Skip to content

feat: add graceful terminate option to terminate-agent-hook #1099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"gitter",
"Niek",
"oxsecurity",
"redrive",
"signoff",
"typecheck",
"userdata",
Expand Down
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[MASTER]
init-hook="import sys; sys.path.insert(0, '/usr/local/lib/python3.11/site-packages/')"
init-hook="import sys; sys.path.insert(0, '/usr/local/lib/python3.12/site-packages/')"

[FORMAT]
max-line-length=132
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,9 @@ The runner supports 3 main scenarios:

![runners-docker](https://github.com/cattle-ops/terraform-aws-gitlab-runner/raw/main/assets/images/runner-docker.png)

For detailed concepts and usage please refer to [usage](docs/usage.md).
For detailed information on usage please refer to [usage](docs/usage.md).

Key concepts for module developers are explained in [concepts](docs/concepts.md).

## Contributors ✨

Expand Down Expand Up @@ -205,6 +207,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
| <a name="input_runner_worker_docker_services_volumes_tmpfs"></a> [runner\_worker\_docker\_services\_volumes\_tmpfs](#input\_runner\_worker\_docker\_services\_volumes\_tmpfs) | Mount a tmpfs in gitlab service container. https://docs.gitlab.com/runner/executors/docker.html#mounting-a-directory-in-ram | <pre>list(object({<br> volume = string<br> options = string<br> }))</pre> | `[]` | no |
| <a name="input_runner_worker_docker_volumes_tmpfs"></a> [runner\_worker\_docker\_volumes\_tmpfs](#input\_runner\_worker\_docker\_volumes\_tmpfs) | Mount a tmpfs in Executor container. https://docs.gitlab.com/runner/executors/docker.html#mounting-a-directory-in-ram | <pre>list(object({<br> volume = string<br> options = string<br> }))</pre> | `[]` | no |
| <a name="input_runner_worker_gitlab_pipeline"></a> [runner\_worker\_gitlab\_pipeline](#input\_runner\_worker\_gitlab\_pipeline) | post\_build\_script = Script to execute in the pipeline just after the build, but before executing after\_script.<br>pre\_build\_script = Script to execute in the pipeline just before the build.<br>pre\_clone\_script = Script to execute in the pipeline before cloning the Git repository. this can be used to adjust the Git client configuration first, for example. | <pre>object({<br> post_build_script = optional(string, "\"\"")<br> pre_build_script = optional(string, "\"\"")<br> pre_clone_script = optional(string, "\"\"")<br> })</pre> | `{}` | no |
| <a name="input_runner_worker_graceful_terminate"></a> [runner\_worker\_graceful\_terminate](#input\_runner\_worker\_graceful\_terminate) | Enable to gracefully terminate runner instances, giving running jobs a chance to finish.<br><br> enabled = Boolean used to enable or disable graceful terminate.<br>timeout = Time in seconds to wait before aborting graceful termination and force terminating runner instances, this value should be the max duration of jobs using the runner, or else jobs running longer than this value won't finish running<br>retry_period = Time in seconds between retrying to stop the gitlab-runner service<br>job_timeout = Time in seconds to wait for gitlab jobs to stop running when stopping the gitlab-runner service | <pre>object({<br> enabled = optional(bool, false)<br> timeout = optional(number, 1800)<br> retry_period = optional(number, 300)<br> job_timeout = optional(number, 3600)<br> })</pre> | `{}` | no |
| <a name="input_security_group_prefix"></a> [security\_group\_prefix](#input\_security\_group\_prefix) | Set the name prefix and overwrite the `Name` tag for all security groups. | `string` | `""` | no |
| <a name="input_subnet_id"></a> [subnet\_id](#input\_subnet\_id) | Subnet id used for the Runner and Runner Workers. Must belong to the `vpc_id`. In case the fleet mode is used, multiple subnets for<br>the Runner Workers can be provided with runner\_worker\_docker\_machine\_instance.subnet\_ids. | `string` | n/a | yes |
| <a name="input_suppressed_tags"></a> [suppressed\_tags](#input\_suppressed\_tags) | List of tag keys which are automatically removed and never added as default tag by the module. | `list(string)` | `[]` | no |
Expand Down
Binary file added assets/images/graceful_shutdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Graceful Termination

![Graceful Termination](../assets/images/graceful_shutdown.png)
5 changes: 5 additions & 0 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ locals {
use_fleet = var.runner_worker_docker_machine_fleet.enable
private_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : ""
use_new_runner_authentication_gitlab_16 = var.runner_gitlab_registration_config.type != ""
runner_service_stop_timeout = var.runner_worker_graceful_terminate.job_timeout
})

template_runner_config = templatefile("${path.module}/template/runner-config.tftpl",
Expand Down Expand Up @@ -643,6 +644,10 @@ module "terminate_agent_hook" {
name_docker_machine_runners = local.runner_tags_merged["Name"]
role_permissions_boundary = var.iam_permissions_boundary == "" ? null : "arn:${data.aws_partition.current.partition}:iam::${data.aws_caller_identity.current.account_id}:policy/${var.iam_permissions_boundary}"
kms_key_id = local.kms_key
graceful_terminate_enabled = var.runner_worker_graceful_terminate.enabled
graceful_terminate_timeout = var.runner_worker_graceful_terminate.timeout
sqs_max_receive_count = ceil(var.runner_worker_graceful_terminate.timeout / var.runner_worker_graceful_terminate.retry_period) + 1
sqs_visibility_timeout = var.runner_worker_graceful_terminate.retry_period

tags = local.tags
}
24 changes: 24 additions & 0 deletions modules/terminate-agent-hook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ is, that no tags are added to the spot request by the docker+machine driver and
to our module. The rule is, that parts of the Executor's name become part of the related SSH key which is in turn part
of the spot request.

Optionally, graceful terminate can be enabled for this module with the `graceful_terminate_enabled` variable.
When enabled, the lambda function will attempt to stop the `gitlab-runner` service on the runner before terminating
runner instances, which gives running jobs a chance to finish.

## Usage

### Default Behavior - Package With the Module
Expand Down Expand Up @@ -91,6 +95,13 @@ module "runner" {
expiration_days = 90
}

# optional, if excluded then the default terminate instances behavior will be used
runner_worker_graceful_terminate = {
enabled = true # defaults to false
timeout = 600
retry_period = 60
}

runner_gitlab_registration_config = {
type = "instance" # or "group" or "project"
# group_id = 1234 # for "group"
Expand Down Expand Up @@ -141,21 +152,34 @@ No modules.
| [aws_cloudwatch_event_rule.terminate_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource |
| [aws_cloudwatch_event_target.terminate_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource |
| [aws_cloudwatch_log_group.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource |
| [aws_iam_policy.asg_lifecycle](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_policy.graceful_terminate](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_policy.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_policy.spot_request_housekeeping](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_role.asg_lifecycle](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
| [aws_iam_role.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
| [aws_iam_role_policy_attachment.asg_lifecycle](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_iam_role_policy_attachment.graceful_terminate](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_iam_role_policy_attachment.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_iam_role_policy_attachment.spot_request_housekeeping](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_lambda_event_source_mapping.graceful_terminate](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_event_source_mapping) | resource |
| [aws_lambda_function.terminate_runner_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource |
| [aws_lambda_function_event_invoke_config.graceful_terminate](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_event_invoke_config) | resource |
| [aws_lambda_permission.current_version_triggers](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource |
| [aws_lambda_permission.unqualified_alias_triggers](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource |
| [archive_file.terminate_runner_instances_lambda](https://registry.terraform.io/providers/hashicorp/archive/latest/docs/data-sources/file) | data source |
| [aws_caller_identity.this](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity) | data source |
| [aws_iam_policy_document.asg_lifecycle_assume_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_iam_policy_document.assume_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_iam_policy_document.asg_lifecycle](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_iam_policy_document.graceful_terminate](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_iam_policy_document.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_iam_policy_document.spot_request_housekeeping](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
| [aws_partition.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/partition) | data source |
| [aws_region.this](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/region) | data source |
| [aws_sqs_queue.graceful_terminate_dlq](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/sqs_queue) | resource |
| [aws_sqs_queue.graceful_terminate_queue](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/sqs_queue) | resource |
| [aws_ssm_document.stop_gitlab_runner](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ssm_document) | resource |

## Inputs

Expand Down
6 changes: 5 additions & 1 deletion modules/terminate-agent-hook/cloudwatch.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
# ----------------------------------------------------------------------------

resource "aws_cloudwatch_event_rule" "terminate_instances" {
count = var.graceful_terminate_enabled ? 0 : 1

name = "${var.environment}-${var.name}"
description = "Trigger GitLab runner instance lifecycle hook on termination."

Expand All @@ -23,7 +25,9 @@ EOF
}

resource "aws_cloudwatch_event_target" "terminate_instances" {
rule = aws_cloudwatch_event_rule.terminate_instances.name
count = var.graceful_terminate_enabled ? 0 : 1

rule = aws_cloudwatch_event_rule.terminate_instances[0].name
target_id = "${var.environment}-TriggerTerminateLambda"
arn = aws_lambda_function.terminate_runner_instances.arn
}
Expand Down
145 changes: 145 additions & 0 deletions modules/terminate-agent-hook/iam.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,75 @@ data "aws_region" "this" {}
# Terminate Instances - IAM Resources
# ----------------------------------------------------------------------------

################################################################################
### ASG IAM
################################################################################

data "aws_iam_policy_document" "asg_lifecycle_assume_role" {
count = var.graceful_terminate_enabled ? 1 : 0

statement {
actions = [
"sts:AssumeRole",
]
effect = "Allow"

principals {
identifiers = ["autoscaling.amazonaws.com"]
type = "Service"
}
}
}

resource "aws_iam_role" "asg_lifecycle" {
count = var.graceful_terminate_enabled ? 1 : 0

name = "${var.name_iam_objects}-${var.name}-asg-lifecycle"
description = "Role for the graceful terminate ASG lifecycle hook"
path = "/"
permissions_boundary = var.role_permissions_boundary
assume_role_policy = data.aws_iam_policy_document.asg_lifecycle_assume_role[0].json
force_detach_policies = true
tags = var.tags
}

# This IAM policy is used by the ASG lifecycle hook.
data "aws_iam_policy_document" "asg_lifecycle" {
count = var.graceful_terminate_enabled ? 1 : 0

# Permit the GitLab Runner ASG to send messages to SQS
statement {
sid = "ASGLifecycleSqs"
actions = [
"sqs:SendMessage",
"sqs:GetQueueUrl"
]
resources = ["${aws_sqs_queue.graceful_terminate_queue[0].arn}"]
effect = "Allow"
}
}

resource "aws_iam_policy" "asg_lifecycle" {
count = var.graceful_terminate_enabled ? 1 : 0

name = "${var.name_iam_objects}-${var.name}-asg-lifecycle"
path = "/"
policy = data.aws_iam_policy_document.asg_lifecycle[0].json

tags = var.tags
}

resource "aws_iam_role_policy_attachment" "asg_lifecycle" {
count = var.graceful_terminate_enabled ? 1 : 0

role = aws_iam_role.asg_lifecycle[0].name
policy_arn = aws_iam_policy.asg_lifecycle[0].arn
}

################################################################################
### Lambda IAM
################################################################################

data "aws_iam_policy_document" "assume_role" {
statement {
actions = [
Expand Down Expand Up @@ -134,6 +203,65 @@ data "aws_iam_policy_document" "spot_request_housekeeping" {
}
}

data "aws_iam_policy_document" "graceful_terminate" {
count = var.graceful_terminate_enabled ? 1 : 0

# Permit the function to process SQS messages
statement {
sid = "GitLabRunnerGracefulTerminateSQS"
actions = [
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:ReceiveMessage"
]
effect = "Allow"
resources = [
resource.aws_sqs_queue.graceful_terminate_queue[0].arn
]
}

# Permit the function to invoke the SSM document for stopping gitlab-runner
statement {
sid = "GitLabRunnerGracefulTerminateSSMSend"
actions = [
"ssm:SendCommand"
]
effect = "Allow"
resources = [
resource.aws_ssm_document.stop_gitlab_runner[0].arn
]
}

# Permit the function to send SSM commands to the GitLab Runner instance
statement {
sid = "GitLabRunnerGracefulTerminateSSMSendEC2"
actions = [
"ssm:SendCommand"
]
effect = "Allow"
resources = [
"arn:${data.aws_partition.current.partition}:ec2:${data.aws_region.this.name}:${data.aws_caller_identity.this.account_id}:instance/*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can we restrict the instances with the EC2 instance tag gitlab-runner-parent-id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSM command is sent to the gitlab runner instance not the workers, which doesn't have gitlab-runner-parent-id.

]
condition {
test = "StringLike"
variable = "ssm:ResourceTag/Name"
values = ["${var.environment}*"]
}
}

# Permit the function to get SSM command invocation details
statement {
sid = "GitLabRunnerGracefulTerminateSSMGet"
actions = [
"ssm:GetCommandInvocation"
]
effect = "Allow"
resources = [
"*"
]
}
}

resource "aws_iam_policy" "lambda" {
name = "${var.name_iam_objects}-${var.name}-lambda"
path = "/"
Expand All @@ -159,3 +287,20 @@ resource "aws_iam_role_policy_attachment" "spot_request_housekeeping" {
role = aws_iam_role.lambda.name
policy_arn = aws_iam_policy.spot_request_housekeeping.arn
}

resource "aws_iam_policy" "graceful_terminate" {
count = var.graceful_terminate_enabled ? 1 : 0

name = "${var.name_iam_objects}-${var.name}-graceful-terminate"
path = "/"
policy = data.aws_iam_policy_document.graceful_terminate[0].json

tags = var.tags
}

resource "aws_iam_role_policy_attachment" "graceful_terminate" {
count = var.graceful_terminate_enabled ? 1 : 0

role = aws_iam_role.lambda.name
policy_arn = aws_iam_policy.graceful_terminate[0].arn
}
Loading