Option to gracefully terminate runner #1029

long-wan-ep · 2023-11-06T20:31:11Z

Describe the solution you'd like

When the terminate-agent-hook runs, workers are terminated and running jobs are interrupted. We would like an option to gracefully terminate runners, so that the running jobs are given a chance to complete

Describe alternatives you've considered

We previous disabled the creation of terminate-agent-hook and used our own hook + lambda to handle graceful termination, but terminate-agent-hook was made mandatory so we can no longer do this.

Suggest a solution

We suggest adding an option to gracefully terminate runners in the terminate-agent-hook lambda. We can contribute our graceful termination logic to terminate-agent-hook if it works for you. Here is a brief summary of our solution:

Configure the gitlab-runner service to gracefully stop:
ie.

cat <<EOF > /etc/systemd/system/gitlab-runner.service.d/kill.conf
[Service]
# Time to wait before stopping the service in seconds
TimeoutStopSec=600
KillSignal=SIGQUIT
EOF

Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
Lambda triggers from SQS message
Lambda sends command to runner EC2 instance to stop the gitlab-runner service, done via AWS SSM document
Lambda waits for the SSM document to finish executing
a. If the gitlab-runner service successfully stopped, lambda completes lifecycle hook
b. If the gitlab-runner service has not successfully stopped, error is thrown and the SQS message goes back to the queue to be retried in the next run
Lambda terminates any workers still running

The text was updated successfully, but these errors were encountered:

kayman-mk · 2023-11-09T09:35:37Z

Yes and no, I think. The Lambda is executed in case the GitLab Runner (who started the worker) dies. In this case the Runners can continue with the current job, but they are not able to upload the logs, artifacts, ... to GitLab as this needs the GitLab Runner which is no longer there.

As the job might access external resources, ... it makes sense to wait until it is finished and kill the worker then.

kayman-mk · 2023-11-09T09:42:16Z

Some thoughts which popped up during checking your procedure described above:

I think the shutdown timeout of 10 minutes doesn't change anything because the GitLab Runner has been shutdown already and can't be contacted from the workers anymore.
there is no lifecycle hook for the Runner. But I guess you mean the worker instance, right? 2. Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
having some doubts that it complicates the whole setup and might introduce problems. But at the moment I can't think of an easier solution to be honest.

If you could share your implementation, it would be wonderful.

long-wan-ep · 2023-11-09T20:28:33Z

You're right, this wouldn't help the situation where the runner dies. We were intending this for situations where the runner is modified and requires a refresh.

Here is a slimmed down version of our implementation, I added it to the the examples folder: https://github.com/long-wan-ep/terraform-aws-gitlab-runner/tree/graceful-terminate-example/examples/graceful-terminate.

github-actions · 2024-01-09T02:40:56Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions · 2024-01-24T02:41:52Z

This issue was closed because it has been stalled for 15 days with no activity.

long-wan-ep · 2024-02-16T22:29:20Z

Hi @kayman-mk, noticed this issue was auto-closed, could we re-open it?

Does our solution look ok? Or any other ideas we could try?

kayman-mk · 2024-02-22T10:38:32Z

Re-read everything ;-) Let's give it a try.

The teminate-agent-hook is used to kill the Workers in case the Runner (named parent in that module) dies. This should of course never happen. Better to wait until all Executory are finished, then shut-down the Runner.

Could you please propose a PR? Would be a good idea to make the TimeoutStopSec configurable. GitLab uses 7200s, we typically use 3600s for the job timeout.

kayman-mk · 2024-02-22T10:43:54Z

Graceful shutdown the Runner: https://gitlab.com/gitlab-com/runbooks/-/blob/258b29f088b2ad2d0ae955488958080f909d6a32/docs/ci-runners/linux/graceful-shutdown.md#:~:text=When%20Graceful%20Shutdown%20is%20initiated,GitLab%20side%2C%20the%20process%20exits.

Runner upgrade: https://gitlab.com/gitlab-cookbooks/cookbook-wrapper-gitlab-runner/-/blob/master/files/default/runner_upgrade.sh

kayman-mk · 2024-02-22T10:55:32Z

It seems that the terminate-agent-hook is good for removing dangling SSH keys, spot requests, ... but not for stopping the Runner itself. I guess ec2_client.terminate_instance(...) simply kills the instance, which is unwanted, because the Executors are simply killed and we do not wait until they have finished processing their jobs.

long-wan-ep · 2024-02-22T19:42:00Z

Sounds good, we'll start working on a PR soon.

long-wan-ep · 2024-04-24T23:26:09Z

#1117 will resolve this issue.

tmeijn · 2024-04-25T06:13:51Z

My bad, I was meant to comment here, but somehow got lost. Original intent:

Hey @kayman-mk, @long-wan-ep I actually started implementing the proposal discussed in #1067: #1117. This MR still needs some polish, but based on my initial testing it seems to work. It basically makes the Runner Manager a bit more smart and aware of it's own desired state and acts accordingly.

@long-wan-ep definitely not meant to steal your thunder, but I do think #1117, if working, makes the implementation a bit simpler. Hope you do not mind! ❤️

long-wan-ep · 2024-04-25T16:04:26Z

@tmeijn I don't mind at all, your implementation looks great, thanks for opening the PR.

## Description Based on the discussion #1067: 1. Move the EventBridge rule that triggers the Lambda from `TERMINATING` to `TERMINATE`. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup _during_ termination. 2. Introduces a shell script managed by Systemd, that monitors the target lifecycle of the instance and initiates GitLab Runner graceful shutdown. 3. Makes the heartbeat timeout of the ASG terminating hook configurable, with a default of the maximum job timeout + 5 minutes, capped at `7200` (2 hours). 4. Introduces a launching lifecyclehook, allowing the new instance to provision itself and GitLab Runner to provision its set capacity before terminating the current instance. ## Migrations required No, except that if the default behavior of immediately terminating all Workers + Manager, the `runner_worker_graceful_terminate_timeout_duration` variable should be set to 30 (the minimum allowed). ## Verification ### Graceful terminate 1. Deploy this version of the module. 2. Start a long running GitLab job. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the job keeps running and has output. Verify from the instance logs that GitLab Runner service is still running. 6. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and instance is put into `Terminating:Proceed` status ### Zero Downtime deployment 1. Deploy this version of the module. 2. Start multiple, long running GitLab jobs, twice the capacity of the GitLab Runner. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the jobs keep running and have output. Verify from the instance logs that GitLab Runner service is still running. 5. Verify new instance gets spun up, while the current instance stays `InService`. 7. Verify new instance is able to provision its set capacity. 8. Verify new instance starts picking up GitLab jobs from the queue before current instance gets terminated. 9. Observe that there is zero downtime. 10. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and current instance is put into `Terminating:Proceed` status Closes #1029 --------- Co-authored-by: Matthias Kay <[email protected]> Co-authored-by: Matthias Kay <[email protected]>

github-actions bot added the stale Issue/PR is stale and closed automatically label Jan 9, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2024

kayman-mk added enhancement 🆕 New feature or request work-in-progress Issue/PR is worked, should not become stale and removed stale Issue/PR is stale and closed automatically labels Feb 22, 2024

kayman-mk reopened this Feb 22, 2024

long-wan-ep mentioned this issue Mar 15, 2024

feat: add graceful terminate option to terminate-agent-hook #1099

Closed

tmeijn mentioned this issue May 1, 2024

feat: implement graceful shutdown of GitLab Runner #1117

Merged

2 tasks

kayman-mk closed this as completed in #1117 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to gracefully terminate runner #1029

Option to gracefully terminate runner #1029

long-wan-ep commented Nov 6, 2023

kayman-mk commented Nov 9, 2023

kayman-mk commented Nov 9, 2023

long-wan-ep commented Nov 9, 2023

github-actions bot commented Jan 9, 2024

github-actions bot commented Jan 24, 2024

long-wan-ep commented Feb 16, 2024 •

edited

Loading

kayman-mk commented Feb 22, 2024 •

edited

Loading

kayman-mk commented Feb 22, 2024 •

edited

Loading

kayman-mk commented Feb 22, 2024

long-wan-ep commented Feb 22, 2024

long-wan-ep commented Apr 24, 2024

tmeijn commented Apr 25, 2024

long-wan-ep commented Apr 25, 2024

Option to gracefully terminate runner #1029

Option to gracefully terminate runner #1029

Comments

long-wan-ep commented Nov 6, 2023

Describe the solution you'd like

Describe alternatives you've considered

Suggest a solution

kayman-mk commented Nov 9, 2023

kayman-mk commented Nov 9, 2023

long-wan-ep commented Nov 9, 2023

github-actions bot commented Jan 9, 2024

github-actions bot commented Jan 24, 2024

long-wan-ep commented Feb 16, 2024 • edited Loading

kayman-mk commented Feb 22, 2024 • edited Loading

kayman-mk commented Feb 22, 2024 • edited Loading

kayman-mk commented Feb 22, 2024

long-wan-ep commented Feb 22, 2024

long-wan-ep commented Apr 24, 2024

tmeijn commented Apr 25, 2024

long-wan-ep commented Apr 25, 2024

long-wan-ep commented Feb 16, 2024 •

edited

Loading

kayman-mk commented Feb 22, 2024 •

edited

Loading

kayman-mk commented Feb 22, 2024 •

edited

Loading