GCP VM instance not terminating after timeout #834

eemelipa · 2021-12-04T19:17:40Z

Similar issue to #678

I'm starting a self hosted runner via Gitlab CICD to GCP:

deploy-runner:
  stage: start runner
  image: iterativeai/cml:0-dvc2-base1
  resource_group: all
  script:
    - cml runner --cloud=gcp --cloud-region=eu-north --cloud-type=c2-standard-4 --labels=cml-runner --reuse --idle-timeout 600

After the timeout the VM instance is not shutting down.

journalctl --unit cml --no-pager command shows

-- Logs begin at Sat 2021-12-04 18:40:23 UTC, end at Sat 2021-12-04 19:02:12 UTC. --
Dec 04 18:43:29 cml-4ejd6b8lzc systemd[1]: Started cml.service.
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Preparing workdir /tmp/tmp.rtgpwktKf5/.cml/cml-4ejd6b8lzc..."}
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Launching gitlab runner"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"level":"warn","message":"SpotNotifier can not be started."}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.453Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.454Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"ready"}
Dec 04 18:43:42 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:42.276Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_started"}
Dec 04 18:44:38 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:44:38.904Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:46 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:49 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.303Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_ended","success":false}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.649Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.952Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.953Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"runner status","reason":"timeout:600","status":"terminated"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Dec 04 18:56:48 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Unregistering runner cml-4ejd6b8lzc..."}
Dec 04 18:56:49 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"\tSuccess"}
Dec 04 18:56:50 cml-4ejd6b8lzc systemd[1]: cml.service: Succeeded.

The runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.

On Azure similar config worked ok and the instances were shutting down

The text was updated successfully, but these errors were encountered:

eemelipa · 2021-12-04T19:50:32Z

Tried in different region and instance type: --cloud-region=europe-west1-b --cloud-type=n1-standard-1 but that did not help

DavidGOrtega · 2021-12-06T11:49:54Z

@eemelipa thanks a lot for the issue.

dacbd · 2021-12-10T22:28:12Z

hmm 🤔 the opposite of mine #808

@eemelipa can you post a screenshot for me? After you run your pipeline go to the GCP dashboard for the project then the Activity tab and take a screenshot?

If it's in a busy project that has other stuff, search for the "Logs Explorer" and narrow the time to when your pipeline ran, and look for any "severity" high than "notice"

eemelipa · 2021-12-11T09:33:48Z

@dacbd here's the screenshot ^ There weren't any warn/error logs 🤔 The bootCounter: 2 latebootreportevent seems to happen correctly at 5min mark which was the idle-timeout in this case. I don't know if the logs are missing any rows though

The credentials have "Compute Admin" privilege

dacbd · 2021-12-12T17:10:45Z

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

eemelipa · 2021-12-12T19:04:40Z

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

Ok, I adjusted the log filtering and now the firewall inserts are visible:

No errors/warnings though. Any thoughts on what to look for next? I tried giving the service account owner access to the project (i.e., all privileges) but that didn't help

dacbd · 2021-12-12T19:21:06Z

An error here would have made this an easy fix 😞

There is something more complicated going on, "Compute Admin" is a sufficient role.

dacbd · 2021-12-12T19:46:03Z

I'm not sure how much help I can provide without digging into the setup more, or building a custom cml image with more debug logging for the terraform provider

Here you can see a full successful logging lifecycle in gcp

dacbd · 2021-12-12T19:59:22Z

We might be able to get more info out of journalctl by adding a start-up script that creates a custom environment over-ride for the cml.service /etc/systemd/system/cml.service.d/debug.conf

I think something like this should work:
debug.conf

[Service]
Environment="TF_LOG=DEBUG"

eemelipa · 2021-12-13T22:35:59Z

Spend good time fiddling manually on the VM instance and got a step forward

Looks like the problem is that the CML VM instance did not get any GCP service account:

When I put that in place manually and restarted the instance things worked! The instance got deleted after the idle-timeout.

Sounds like something should fail if the instance does not have a service account. Here's couple options that came to my mind (obviously you guys know better what's under the hood):

The iterative terraform provider should fail when fetching credentials in case service account is not in place (maybe somewhere here https://github.com/iterative/terraform-provider-iterative/blob/550c698677b94644a80fa019fab6f62b26787543/iterative/gcp/provider.go#L295)
The GCP service account is specified with --cloud-permission-set cml cli argument. If the --cloud=gcp but the service account is missing then the cml runner command should fail

So missing service account is one problem and we seem to also have a second problem. When I give service account to cml runner command it creates the VM instance correctly with the account but it does not give correct Cloud API access scopes:

When I gave manually the API scopes the idle-timeout deletion worked ok. Sounds like some changes might be needed to the CML instance creation

dacbd · 2021-12-13T23:19:15Z

Hmm, it sounds like some documentation clarification might be required?

Under the hood, cml runner adds the GOOGLE_APPLICATION_CREDENTIALS_DATA that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.

The --cloud-permission-set takes (in GCP's case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.

Are you saying it looks like terraform tried to use those (the --cloud-permission-set) creds instead of the original cml runner ones? That is definitely not intended.

This should be easy for me to reproduce and I'll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)

dacbd · 2021-12-13T23:34:29Z

Can you share more of the yml from gitlab-ci?

I think I misunderstood some of your last reply.

Can you share the permissions list or GCP managed role for the service account whose key should be set in the GOOGLE_APPLICATION_CREDENTIALS_DATA environment variable when you invoke your cml runner command and the same for the one you are trying to set with the --cloud-permission-set ?

~~Also, cat out the /etc/systemd/system/cml.service file and be sure to check and redact any sensitive values that might be there~~

edit:
In the same screen as your first screenshot there's a metadata section. Can you verify GOOGLE_APPLICATION_CREDENTIALS_DATA json is your service account you invoked cml runner with, in the startup-script?

dacbd · 2021-12-14T23:10:14Z

Can be closed with TPI fix: iterative/terraform-provider-iterative#333

casperdcl · 2021-12-16T13:33:23Z

fixed by iterative/terraform-provider-iterative#333; thanks @dacbd

DavidGOrtega added p0-critical Max priority (ASAP) cml-runner Subcommand cloud-gcp Google Cloud labels Dec 6, 2021

casperdcl linked a pull request Dec 7, 2021 that will close this issue

Bullet proof runner termination #689

Merged

dacbd mentioned this issue Dec 14, 2021

GCP auth logic iterative/terraform-provider-iterative#333

Merged

casperdcl closed this as completed Dec 16, 2021

hopeai mentioned this issue Jun 9, 2023

GCP VM instance does not terminate after idle timeout #1386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP VM instance not terminating after timeout #834

GCP VM instance not terminating after timeout #834

eemelipa commented Dec 4, 2021 •

edited

Loading

eemelipa commented Dec 4, 2021

DavidGOrtega commented Dec 6, 2021

dacbd commented Dec 10, 2021

eemelipa commented Dec 11, 2021

dacbd commented Dec 12, 2021

eemelipa commented Dec 12, 2021

dacbd commented Dec 12, 2021

dacbd commented Dec 12, 2021

dacbd commented Dec 12, 2021

eemelipa commented Dec 13, 2021

dacbd commented Dec 13, 2021

dacbd commented Dec 13, 2021 •

edited

Loading

dacbd commented Dec 14, 2021

casperdcl commented Dec 16, 2021

GCP VM instance not terminating after timeout #834

GCP VM instance not terminating after timeout #834

Comments

eemelipa commented Dec 4, 2021 • edited Loading

eemelipa commented Dec 4, 2021

DavidGOrtega commented Dec 6, 2021

dacbd commented Dec 10, 2021

eemelipa commented Dec 11, 2021

dacbd commented Dec 12, 2021

eemelipa commented Dec 12, 2021

dacbd commented Dec 12, 2021

dacbd commented Dec 12, 2021

dacbd commented Dec 12, 2021

eemelipa commented Dec 13, 2021

dacbd commented Dec 13, 2021

dacbd commented Dec 13, 2021 • edited Loading

dacbd commented Dec 14, 2021

casperdcl commented Dec 16, 2021

eemelipa commented Dec 4, 2021 •

edited

Loading

dacbd commented Dec 13, 2021 •

edited

Loading