Skip to content

Commit d2e2224

Browse files
tmeijnkayman-mk
andauthored
feat: implement graceful shutdown of GitLab Runner (#1117)
## Description Based on the discussion #1067: 1. Move the EventBridge rule that triggers the Lambda from `TERMINATING` to `TERMINATE`. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup _during_ termination. 2. Introduces a shell script managed by Systemd, that monitors the target lifecycle of the instance and initiates GitLab Runner graceful shutdown. 3. Makes the heartbeat timeout of the ASG terminating hook configurable, with a default of the maximum job timeout + 5 minutes, capped at `7200` (2 hours). 4. Introduces a launching lifecyclehook, allowing the new instance to provision itself and GitLab Runner to provision its set capacity before terminating the current instance. ## Migrations required No, except that if the default behavior of immediately terminating all Workers + Manager, the `runner_worker_graceful_terminate_timeout_duration` variable should be set to 30 (the minimum allowed). ## Verification ### Graceful terminate 1. Deploy this version of the module. 2. Start a long running GitLab job. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the job keeps running and has output. Verify from the instance logs that GitLab Runner service is still running. 6. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and instance is put into `Terminating:Proceed` status ### Zero Downtime deployment 1. Deploy this version of the module. 2. Start multiple, long running GitLab jobs, twice the capacity of the GitLab Runner. 3. Manually trigger an instance refresh in the runner ASG. 4. Verify the jobs keep running and have output. Verify from the instance logs that GitLab Runner service is still running. 5. Verify new instance gets spun up, while the current instance stays `InService`. 7. Verify new instance is able to provision its set capacity. 8. Verify new instance starts picking up GitLab jobs from the queue before current instance gets terminated. 9. Observe that there is zero downtime. 10. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and current instance is put into `Terminating:Proceed` status Closes #1029 --------- Co-authored-by: Matthias Kay <[email protected]> Co-authored-by: Matthias Kay <[email protected]>
1 parent d37eb59 commit d2e2224

16 files changed

+231
-40
lines changed

.cspell.json

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
"amazonec",
77
"anytrue",
88
"amannn",
9+
"autonumber",
910
"awscli",
1011
"boto",
1112
"botocore",
@@ -53,6 +54,7 @@
5354
"tftpl",
5455
"tfvars",
5556
"tmpfs",
57+
"tonumber",
5658
"trivy",
5759
"userns",
5860
"xanzy",

.pre-commit-config.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
repos:
22
- repo: https://github.com/antonbabenko/pre-commit-terraform
3-
rev: v1.64.1
3+
rev: v1.89.0
44
hooks:
55
- id: terraform_fmt
66
args:
77
- --args=-recursive
8-
- id: terraform_tflint
98
- repo: https://github.com/pre-commit/pre-commit-hooks
109
rev: v4.2.0
1110
hooks:

.pylintrc

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[MASTER]
2-
init-hook="import sys; sys.path.insert(0, '/usr/local/lib/python3.11/site-packages/')"
2+
init-hook="import sys; sys.path.insert(0, '/usr/local/lib/python3.12/site-packages/')"
33

44
[FORMAT]
55
max-line-length=132

docs/usage.md

+41-16
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,14 @@ module "runner" {
5454
5555
vpc_id = module.vpc.vpc_id
5656
subnet_id = element(module.vpc.private_subnets, 0)
57-
57+
5858
runner_instance = {
59-
name = "docker-default"
59+
name = "docker-default"
6060
}
61-
61+
6262
runner_gitlab = {
6363
url = "https://gitlab.com"
64-
64+
6565
preregistered_runner_token_ssm_parameter_name = "my-gitlab-runner-token-ssm-parameter-name"
6666
}
6767
}
@@ -77,23 +77,23 @@ map. A simple example for this would be to set _region-specific-prefix_ to the A
7777
module "runner" {
7878
# https://registry.terraform.io/modules/cattle-ops/gitlab-runner/aws/
7979
source = "cattle-ops/gitlab-runner/aws"
80-
80+
8181
environment = "multi-region-1"
8282
iam_object_prefix = "<region-specific-prefix>-gitlab-runner-iam"
83-
83+
8484
vpc_id = module.vpc.vpc_id
8585
subnet_id = element(module.vpc.private_subnets, 0)
86-
86+
8787
runner_gitlab = {
8888
url = "https://gitlab.com"
8989
9090
preregistered_runner_token_ssm_parameter_name = "my-gitlab-runner-token-ssm-parameter-name"
9191
}
92-
92+
9393
runner_worker_cache = {
9494
bucket_prefix = "<region-specific-prefix>"
9595
}
96-
96+
9797
runner_worker_docker_machine_instance = {
9898
subnet_ids = module.vpc.private_subnets
9999
}
@@ -208,14 +208,39 @@ module "runner" {
208208
}
209209
```
210210

211-
#### Instance Termination
212-
213-
The Auto Scaling Group may be configured with a [lifecycle hook](https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html)
214-
that executes a provided Lambda function when the runner is terminated to terminate additional instances that were spawned.
215-
216-
The use of the termination lifecycle can be toggled using the `runner_enable_asg_recreation` variable.
211+
#### Graceful termination / Zero Downtime deployment
212+
213+
This module supports zero-downtime deployments by following a structured process:
214+
215+
- The new instance is first set to the `pending` state, allowing it to provision both GitLab Runner and its configured capacity.
216+
This process is allocated a maximum of five minutes.
217+
- Once provisioning is complete, a signal is sent to the current instance, setting it to the `terminating:wait` state.
218+
- This signal triggers the monitor_runner.sh systemd service, which sends a SIGQUIT signal to the GitLab Runner process,
219+
initiating a graceful shutdown.
220+
- The maximum allowed time for the shutdown process is defined by the `runner_terminate_ec2_lifecycle_timeout_duration` variable.
221+
222+
The diagram below illustrates this process.
223+
224+
```mermaid
225+
sequenceDiagram
226+
autonumber
227+
participant ASG as Autoscaling Group
228+
participant CI as Current Instance
229+
participant NI as New Instance
230+
ASG->>NI: Provision New Instance (status: Pending)
231+
Note over NI: Install GitLab Runner <br/>and provision capacity<br/>(5m grace period)
232+
ASG->>NI: Set status to InService
233+
ASG->>CI: Set status to Terminating:Wait
234+
CI->>CI: Graceful terminate:<br/>Stop picking up new jobs,<br/>Finish current jobs<br/>assigned to this Runner
235+
CI->>ASG: Send complete-lifecycle-action
236+
ASG->>CI: Set status to Terminating:Proceed
237+
Note over CI: Instance is terminated:<br/>Cleanup Lambda is triggered
238+
```
217239

218-
When using this feature, a `builds/` directory relative to the root module will persist that contains the packaged Lambda function.
240+
The Auto Scaling Group is configured with a [lifecycle hook](https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html)
241+
that executes a provided Lambda function when the runner is terminated to terminate additional instances that were
242+
provisioned by the Docker Machine executor. a `builds/` directory relative to the root module persists that
243+
contains the packaged Lambda function.
219244

220245
### Access the Runner instance
221246

locals.tf

+4
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,10 @@ locals {
9191
]
9292

9393
docker_machine_adds_name_tag = signum(sum(local.docker_machine_version_test)) <= 0
94+
95+
runner_worker_graceful_terminate_timeout_duration = (var.runner_terminate_ec2_lifecycle_timeout_duration == null
96+
? min(7200, tonumber(coalesce(var.runner_gitlab_registration_config.maximum_timeout, 0)) + 300)
97+
: var.runner_terminate_ec2_lifecycle_timeout_duration)
9498
}
9599

96100
resource "local_file" "config_toml" {

main.tf

+26-10
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ locals {
8080
use_fleet = var.runner_worker_docker_machine_fleet.enable
8181
private_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : ""
8282
use_new_runner_authentication_gitlab_16 = var.runner_gitlab_registration_config.type != ""
83+
user_data_trace_log = var.debug.trace_runner_user_data
8384
})
8485

8586
template_runner_config = templatefile("${path.module}/template/runner-config.tftpl",
@@ -174,10 +175,15 @@ resource "aws_autoscaling_group" "gitlab_runner_instance" {
174175
version = aws_launch_template.gitlab_runner_instance.latest_version
175176
}
176177

178+
instance_maintenance_policy {
179+
max_healthy_percentage = 110
180+
min_healthy_percentage = 100
181+
}
182+
177183
instance_refresh {
178184
strategy = "Rolling"
179185
preferences {
180-
min_healthy_percentage = 0
186+
min_healthy_percentage = 100
181187
}
182188
triggers = ["tag"]
183189
}
@@ -656,21 +662,31 @@ resource "aws_iam_role_policy_attachment" "eip" {
656662
policy_arn = aws_iam_policy.eip[0].arn
657663
}
658664

665+
# We wait for 5 minutes until we set an EC2 instance to status `InService` so it has time to provision itself and it's configured capacity.
666+
resource "aws_autoscaling_lifecycle_hook" "wait_for_gitlab_runner" {
667+
name = "${var.environment}-wait-for-gitlab-runner-up"
668+
autoscaling_group_name = aws_autoscaling_group.gitlab_runner_instance.name
669+
default_result = "CONTINUE"
670+
heartbeat_timeout = 300
671+
lifecycle_transition = "autoscaling:EC2_INSTANCE_LAUNCHING"
672+
}
673+
659674
################################################################################
660675
### Lambda function triggered as soon as an agent is terminated.
661676
################################################################################
662677
module "terminate_agent_hook" {
663678
source = "./modules/terminate-agent-hook"
664679

665-
name = var.runner_terminate_ec2_lifecycle_hook_name == null ? "terminate-instances" : var.runner_terminate_ec2_lifecycle_hook_name
666-
environment = var.environment
667-
asg_arn = aws_autoscaling_group.gitlab_runner_instance.arn
668-
asg_name = aws_autoscaling_group.gitlab_runner_instance.name
669-
cloudwatch_logging_retention_in_days = var.runner_cloudwatch.retention_days
670-
name_iam_objects = local.name_iam_objects
671-
name_docker_machine_runners = local.runner_tags_merged["Name"]
672-
role_permissions_boundary = var.iam_permissions_boundary == "" ? null : "arn:${data.aws_partition.current.partition}:iam::${data.aws_caller_identity.current.account_id}:policy/${var.iam_permissions_boundary}"
673-
kms_key_id = local.kms_key
680+
name = var.runner_terminate_ec2_lifecycle_hook_name == null ? "terminate-instances" : var.runner_terminate_ec2_lifecycle_hook_name
681+
environment = var.environment
682+
asg_arn = aws_autoscaling_group.gitlab_runner_instance.arn
683+
asg_name = aws_autoscaling_group.gitlab_runner_instance.name
684+
cloudwatch_logging_retention_in_days = var.runner_cloudwatch.retention_days
685+
name_iam_objects = local.name_iam_objects
686+
name_docker_machine_runners = local.runner_tags_merged["Name"]
687+
role_permissions_boundary = var.iam_permissions_boundary == "" ? null : "arn:${data.aws_partition.current.partition}:iam::${data.aws_caller_identity.current.account_id}:policy/${var.iam_permissions_boundary}"
688+
kms_key_id = local.kms_key
689+
asg_hook_terminating_heartbeat_timeout = local.runner_worker_graceful_terminate_timeout_duration
674690

675691
tags = local.tags
676692
}

modules/terminate-agent-hook/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ No modules.
162162
| Name | Description | Type | Default | Required |
163163
|------|-------------|------|---------|:--------:|
164164
| <a name="input_asg_arn"></a> [asg\_arn](#input\_asg\_arn) | The ARN of the Auto Scaling Group to attach to. | `string` | n/a | yes |
165+
| <a name="input_asg_hook_terminating_heartbeat_timeout"></a> [asg\_hook\_terminating\_heartbeat\_timeout](#input\_asg\_hook\_terminating\_heartbeat\_timeout) | Duration the ASG should stay in the Terminating:Wait state. | `number` | `30` | no |
165166
| <a name="input_asg_name"></a> [asg\_name](#input\_asg\_name) | The name of the Auto Scaling Group to attach to. The 'environment' will be prefixed to this. | `string` | n/a | yes |
166167
| <a name="input_cloudwatch_logging_retention_in_days"></a> [cloudwatch\_logging\_retention\_in\_days](#input\_cloudwatch\_logging\_retention\_in\_days) | The number of days to retain logs in CloudWatch. | `number` | `30` | no |
167168
| <a name="input_enable_xray_tracing"></a> [enable\_xray\_tracing](#input\_enable\_xray\_tracing) | Enables X-Ray for debugging and analysis | `bool` | `false` | no |

modules/terminate-agent-hook/cloudwatch.tf

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ resource "aws_cloudwatch_event_rule" "terminate_instances" {
1212
event_pattern = <<EOF
1313
{
1414
"source": ["aws.autoscaling"],
15-
"detail-type": ["EC2 Instance-terminate Lifecycle Action"],
15+
"detail-type": ["EC2 Instance Terminate Successful", "EC2 Instance Terminate Unsuccessful"],
1616
"detail": {
1717
"AutoScalingGroupName": ["${var.asg_name}"]
1818
}

modules/terminate-agent-hook/lambda/lambda_function.py

-3
Original file line numberDiff line numberDiff line change
@@ -234,9 +234,6 @@ def handler(event, context):
234234
"""
235235
event_detail = event['detail']
236236

237-
if event_detail['LifecycleTransition'] != "autoscaling:EC2_INSTANCE_TERMINATING":
238-
sys.exit()
239-
240237
client = boto3.client("ec2", region_name=event['region'])
241238

242239
# make sure that no new instances are created

modules/terminate-agent-hook/locals.tf

-3
This file was deleted.

modules/terminate-agent-hook/main.tf

+2-2
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ resource "aws_lambda_function" "terminate_runner_instances" {
3636
publish = true
3737
role = aws_iam_role.lambda.arn
3838
runtime = "python3.11"
39-
timeout = local.lambda_timeout
39+
timeout = 30
4040
kms_key_arn = var.kms_key_id
4141

4242
tags = var.tags
@@ -77,6 +77,6 @@ resource "aws_autoscaling_lifecycle_hook" "terminate_instances" {
7777
name = "${var.environment}-${var.name}"
7878
autoscaling_group_name = var.asg_name
7979
default_result = "CONTINUE"
80-
heartbeat_timeout = local.lambda_timeout + 20 # allow some extra time for cold starts
80+
heartbeat_timeout = var.asg_hook_terminating_heartbeat_timeout
8181
lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
8282
}

modules/terminate-agent-hook/variables.tf

+11
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,14 @@ variable "enable_xray_tracing" {
6060
type = bool
6161
default = false
6262
}
63+
64+
variable "asg_hook_terminating_heartbeat_timeout" {
65+
description = "Duration in seconds the ASG should stay in the Terminating:Wait state."
66+
type = number
67+
default = 30
68+
69+
validation {
70+
condition = var.asg_hook_terminating_heartbeat_timeout >= 30 && var.asg_hook_terminating_heartbeat_timeout <= 7200
71+
error_message = "AWS only supports heartbeat timeout in the range of 30 to 7200."
72+
}
73+
}

policies/instance-docker-machine-policy.json

+3-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
"ec2:CancelSpotInstanceRequests",
1919
"ec2:DescribeSubnets",
2020
"ec2:AssociateIamInstanceProfile",
21-
"ec2:CreateFleet"
21+
"ec2:CreateFleet",
22+
"autoscaling:CompleteLifecycleAction",
23+
"autoscaling:DescribeLifecycleHooks"
2224
],
2325
"Effect": "Allow",
2426
"Resource": "*"

0 commit comments

Comments
 (0)