Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP cloud runner not terminating #678

Closed
a3lem opened this issue Jul 28, 2021 · 7 comments
Closed

GCP cloud runner not terminating #678

a3lem opened this issue Jul 28, 2021 · 7 comments
Assignees
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@a3lem
Copy link

a3lem commented Jul 28, 2021

This is a repeat of #661, which was supposedly fixed in #653. Unfortunately, I'm not seeing any changes in the shutdown behavior of my GCP compute instances. That is, they keep running past the timeout interval.

I'm using the same workflow as before (in #661):

name: 'Train-in-the-cloud-GCP'
on: 
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: 'Deploy runner on GCP'
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          # Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
          # `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
          # latter causes problems for terraform.
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
        run: |
          cml-runner \
          --cloud gcp \
          --cloud-region europe-west1-b	 \
          --cloud-type=n1-standard-1 \
          --labels=cml-runner
          
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: 'Train my dummy model'
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        run: |
          echo "Training a super awesome model"
          sleep 5
          echo "Training complete"

Anyway, this seems to contradict the tests, as @DavidGOrtega explains in the comments under #653:

[...] tests with TPI indicates that the instances are disposed after the expected time.

Any idea what I might be doing wrong?

@DavidGOrtega DavidGOrtega self-assigned this Jul 28, 2021
@DavidGOrtega DavidGOrtega added p0-critical Max priority (ASAP) cml-runner Subcommand labels Jul 28, 2021
@casperdcl casperdcl mentioned this issue Jul 28, 2021
4 tasks
@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 29, 2021

i can say again that GCP is terminating as expected. Can you please get into the machine and run and let me know what says?

journalctl --unit cml --no-pager

@lemontheme

@a3lem
Copy link
Author

a3lem commented Jul 29, 2021

Sure thing. Here's what I get:

-- Logs begin at Thu 2021-07-29 14:22:56 UTC, end at Thu 2021-07-29 14:38:42 UTC. --
Jul 29 14:26:39 cml-36s36ywc7z systemd[1]: Started cml.service.
Jul 29 14:26:46 cml-36s36ywc7z cml.sh[17975]: Preparing workdir /tmp/tmp.b7BwstF7kJ/.cml/cml-07toknujbd...
Jul 29 14:26:46 cml-36s36ywc7z cml.sh[17975]: Launching github runner
Jul 29 14:27:10 cml-36s36ywc7z cml.sh[17975]: SpotNotifier can not be started.
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.452Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":""}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.453Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":"√ "}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.454Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":"Connected to Git
Hub"}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.454Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":""}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.995Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","status":"ready","message":
"Listening for Jobs"}
Jul 29 14:27:22 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:22.333Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","job":3192860335,"status":"
job_started","message":"Running job: model-training"}
Jul 29 14:34:29 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:34:29.721Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","job":"","status":"job_ende
d","success":true,"message":"Job model-training completed with result: Succeeded"}

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 29, 2021

And stills? AS far as I can see the timeout is not happening...
What a weird thing

@DavidGOrtega
Copy link
Contributor

I can see that the runner is terminating properly itself with idle-time however when I destroy it using the terraform provider, sometimes GCP does not send the graceful shutdown

image

However this does reflect the issue here where seems that the chrono might be not working

@dacbd
Copy link
Contributor

dacbd commented Oct 14, 2021

@lemontheme I believe this issue is resolved, can you confirm your workflow is functional without any workarounds?

@a3lem
Copy link
Author

a3lem commented Oct 19, 2021

Hi @dacbd, sorry to keep you waiting. Been a while since I looked at this.

Anyway, I'm happy to confirm that instances are now indeed stopped and deleted as expected! :) That's using the exact same workflow as above. Great to see you've made progress with this. Thanks!

@0x2b3bfa0
Copy link
Member

Thank you very much, @dacbd for the fix and @lemontheme for confirming the resolution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

No branches or pull requests

4 participants