-
-
Notifications
You must be signed in to change notification settings - Fork 337
Option to gracefully terminate runner #1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes and no, I think. The Lambda is executed in case the GitLab Runner (who started the worker) dies. In this case the Runners can continue with the current job, but they are not able to upload the logs, artifacts, ... to GitLab as this needs the GitLab Runner which is no longer there. As the job might access external resources, ... it makes sense to wait until it is finished and kill the worker then. |
Some thoughts which popped up during checking your procedure described above:
If you could share your implementation, it would be wonderful. |
You're right, this wouldn't help the situation where the runner dies. We were intending this for situations where the runner is modified and requires a refresh. Here is a slimmed down version of our implementation, I added it to the the examples folder: https://github.com/long-wan-ep/terraform-aws-gitlab-runner/tree/graceful-terminate-example/examples/graceful-terminate. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days. |
This issue was closed because it has been stalled for 15 days with no activity. |
Hi @kayman-mk, noticed this issue was auto-closed, could we re-open it? Does our solution look ok? Or any other ideas we could try? |
Re-read everything ;-) Let's give it a try. The Could you please propose a PR? Would be a good idea to make the |
It seems that the |
Sounds good, we'll start working on a PR soon. |
#1117 will resolve this issue. |
My bad, I was meant to comment here, but somehow got lost. Original intent: Hey @kayman-mk, @long-wan-ep I actually started implementing the proposal discussed in #1067: #1117. This MR still needs some polish, but based on my initial testing it seems to work. It basically makes the Runner Manager a bit more smart and aware of it's own desired state and acts accordingly. @long-wan-ep definitely not meant to steal your thunder, but I do think #1117, if working, makes the implementation a bit simpler. Hope you do not mind! ❤️ |
@tmeijn I don't mind at all, your implementation looks great, thanks for opening the PR. |
## Description Based on the discussion #1067: 1. Move the EventBridge rule that triggers the Lambda from `TERMINATING` to `TERMINATE`. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup _during_ termination. 2. Introduces a shell script managed by Systemd, that monitors the target lifecycle of the instance and initiates GitLab Runner graceful shutdown. 3. Makes the heartbeat timeout of the ASG terminating hook configurable, with a default of the maximum job timeout + 5 minutes, capped at `7200` (2 hours). 4. Introduces a launching lifecyclehook, allowing the new instance to provision itself and GitLab Runner to provision its set capacity before terminating the current instance. ## Migrations required No, except that if the default behavior of immediately terminating all Workers + Manager, the `runner_worker_graceful_terminate_timeout_duration` variable should be set to 30 (the minimum allowed). ## Verification ### Graceful terminate 1. Deploy this version of the module. 2. Start a long running GitLab job. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the job keeps running and has output. Verify from the instance logs that GitLab Runner service is still running. 6. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and instance is put into `Terminating:Proceed` status ### Zero Downtime deployment 1. Deploy this version of the module. 2. Start multiple, long running GitLab jobs, twice the capacity of the GitLab Runner. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the jobs keep running and have output. Verify from the instance logs that GitLab Runner service is still running. 5. Verify new instance gets spun up, while the current instance stays `InService`. 7. Verify new instance is able to provision its set capacity. 8. Verify new instance starts picking up GitLab jobs from the queue before current instance gets terminated. 9. Observe that there is zero downtime. 10. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and current instance is put into `Terminating:Proceed` status Closes #1029 --------- Co-authored-by: Matthias Kay <[email protected]> Co-authored-by: Matthias Kay <[email protected]>
Describe the solution you'd like
When the
terminate-agent-hook
runs, workers are terminated and running jobs are interrupted. We would like an option to gracefully terminate runners, so that the running jobs are given a chance to completeDescribe alternatives you've considered
We previous disabled the creation of
terminate-agent-hook
and used our own hook + lambda to handle graceful termination, butterminate-agent-hook
was made mandatory so we can no longer do this.Suggest a solution
We suggest adding an option to gracefully terminate runners in the
terminate-agent-hook
lambda. We can contribute our graceful termination logic toterminate-agent-hook
if it works for you. Here is a brief summary of our solution:ie.
a. If the gitlab-runner service successfully stopped, lambda completes lifecycle hook
b. If the gitlab-runner service has not successfully stopped, error is thrown and the SQS message goes back to the queue to be retried in the next run
The text was updated successfully, but these errors were encountered: