You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: implement graceful shutdown of GitLab Runner (#1117)
## Description
Based on the discussion
#1067:
1. Move the EventBridge rule that triggers the Lambda from `TERMINATING`
to `TERMINATE`. The Lambda now functions as an "after-the-fact" cleanup
instead of being responsible of cleanup _during_ termination.
2. Introduces a shell script managed by Systemd, that monitors the
target lifecycle of the instance and initiates GitLab Runner graceful
shutdown.
3. Makes the heartbeat timeout of the ASG terminating hook configurable,
with a default of the maximum job timeout + 5 minutes, capped at `7200`
(2 hours).
4. Introduces a launching lifecyclehook, allowing the new instance to
provision itself and GitLab Runner to provision its set capacity before
terminating the current instance.
## Migrations required
No, except that if the default behavior of immediately terminating all
Workers + Manager, the
`runner_worker_graceful_terminate_timeout_duration` variable should be
set to 30 (the minimum allowed).
## Verification
### Graceful terminate
1. Deploy this version of the module.
2. Start a long running GitLab job.
3. Manually trigger an instance refresh in the runner ASG.
4. Verify the job keeps running and has output. Verify from the instance
logs that GitLab Runner service is still running.
6. Once remaining jobs have been completed, observe that GitLab Runner
service is terminated and instance is put into `Terminating:Proceed`
status
### Zero Downtime deployment
1. Deploy this version of the module.
2. Start multiple, long running GitLab jobs, twice the capacity of the
GitLab Runner.
3. Manually trigger an instance refresh in the runner ASG.
4. Verify the jobs keep running and have output. Verify from the
instance logs that GitLab Runner service is still running.
5. Verify new instance gets spun up, while the current instance stays
`InService`.
7. Verify new instance is able to provision its set capacity.
8. Verify new instance starts picking up GitLab jobs from the queue
before current instance gets terminated.
9. Observe that there is zero downtime.
10. Once remaining jobs have been completed, observe that GitLab Runner
service is terminated and current instance is put into
`Terminating:Proceed` status
Closes#1029
---------
Co-authored-by: Matthias Kay <[email protected]>
Co-authored-by: Matthias Kay <[email protected]>
| <aname="input_asg_arn"></a> [asg\_arn](#input\_asg\_arn)| The ARN of the Auto Scaling Group to attach to. |`string`| n/a | yes |
165
+
| <aname="input_asg_hook_terminating_heartbeat_timeout"></a> [asg\_hook\_terminating\_heartbeat\_timeout](#input\_asg\_hook\_terminating\_heartbeat\_timeout)| Duration the ASG should stay in the Terminating:Wait state. |`number`|`30`| no |
165
166
| <aname="input_asg_name"></a> [asg\_name](#input\_asg\_name)| The name of the Auto Scaling Group to attach to. The 'environment' will be prefixed to this. |`string`| n/a | yes |
166
167
| <aname="input_cloudwatch_logging_retention_in_days"></a> [cloudwatch\_logging\_retention\_in\_days](#input\_cloudwatch\_logging\_retention\_in\_days)| The number of days to retain logs in CloudWatch. |`number`|`30`| no |
167
168
| <aname="input_enable_xray_tracing"></a> [enable\_xray\_tracing](#input\_enable\_xray\_tracing)| Enables X-Ray for debugging and analysis |`bool`|`false`| no |
0 commit comments