feat: add graceful terminate option to terminate-agent-hook #1099

long-wan-ep · 2024-03-14T16:42:29Z

Description

Resolves #1029.

Adds optional functionality to the terminate-agent-hook module for graceful termination, which users can enabled/configure using new input variables. When graceful terminate is enabled, the lambda will give running jobs a chance to finish before the runner instances are terminated. When graceful terminate is disabled, the original behavior of the terminate-agent-hook is used.

Updated .pylintrc to use /usr/local/lib/python3.12/site-packages/, because pylint was not able to find the boto3 dependencies in /usr/local/lib/python3.11/site-packages/. Seems like megalinter is using python 3.12 now.

Migrations required

No

Verification

With graceful terminate enabled

Manually deployed a runner with:

runner_worker_graceful_terminate = {
    enabled      = true
    retry_period = 60
    timeout      = 300
}

Started a GitLab job using the runner
Manually triggered instance refresh in the runner ASG
Verified the new graceful termination steps in the lambda runs successfully
Verified the existing steps in the lambda runs successfully

With graceful terminate disabled

Manually deployed a runner
Started a GitLab job using the runner
Manually triggered instance refresh in the runner ASG
Verified the lambda runs successfully without graceful termination steps

github-actions · 2024-03-14T16:42:40Z

Hey @long-wan-ep! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

the problem being solved
the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

/help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

long-wan-ep · 2024-03-15T21:31:41Z

Currently if graceful terminate is enabled, the runner is highly likely to run into #1062, since graceful terminate delays runner termination during a runner refresh. I think we should tackle that issue first before we implement this feature.

kayman-mk · 2024-03-22T09:42:45Z

#1062 will be solved via #1102 soon.

kayman-mk

Thanks for your work, @long-wan-ep

I am still trying to understand the termination process. It looks like that we are not waiting until all jobs on the workers/executors are finished but we are killing them at time of GitLab Runner shutdown.

kayman-mk · 2024-03-22T09:50:55Z

modules/terminate-agent-hook/iam.tf

+    ]
+    effect = "Allow"
+    resources = [
+      "arn:${data.aws_partition.current.partition}:ec2:${data.aws_region.this.name}:${data.aws_caller_identity.this.account_id}:instance/*"


question: can we restrict the instances with the EC2 instance tag gitlab-runner-parent-id?

The SSM command is sent to the gitlab runner instance not the workers, which doesn't have gitlab-runner-parent-id.

modules/terminate-agent-hook/lambda/lambda_function.py

kayman-mk · 2024-03-22T10:01:40Z

variables.tf

+variable "runner_worker_graceful_terminate" {
+  description = "Gracefully terminate Runner Worker, by waiting a set amount of time for running jobs to finish before termination."
+  type = object({
+    enabled      = optional(bool, false)
+    timeout      = optional(number, 1800)
+    retry_period = optional(number, 300)
+    job_timeout  = optional(number, 3600)
+  })
+  default = {}
+}


suggestion: remove this variable and enable the graceful termination always. Is there any reason why someone should not use this?

Edit: Just amending the docs. It sounds like you massively improved the termination of the instances. So from my perspective there is no reason not to use it by default.

I made it an optional feature because I thought there might be users who don't need/want graceful termination and want to keep the existing behavior, but I agree that it should be an improvement for most users. We could set enabled default to true if we want to give users the choice, or if you don't think that's necessary then we can remove the enabled field and make it mandatory.

variables.tf

kayman-mk · 2024-03-22T10:38:44Z

modules/terminate-agent-hook/lambda/lambda_function.py

+    if command_response['Status'] == "Success":
+        print(json.dumps({
+            "Level": "info",
+            "Message": f"gitlab-runner service stopped, SSM command response: {command_response}"


nitpick: This seems to be incorrect as we do not know if the service has been stopped already. It might still be running, waiting for jobs to be finished, right?

The SSM command sent to the runner instance will stop the gitlab-runner service and only return success if the service is inactive.

So the waiter above returns only in case the service has been stopped? This could last up to 1800s (by default), but as far as I understand the waiter, it definitely returns after 3*10 seconds, doesn't it? Is this a problem? Wouldn't we return a failure then?

The SSM command sent to the instance does not wait for the service to finish stopping(1800s), it will:

attempt to stop gitlab-runner.service(non blocking)

wait 5s

check the status of gitlab-runner.service

exit 0(SSM command status = "Success") if the service is stopped and docker machines removed, exit 1(SSM command status = "Failed") otherwise

This if statement checks the status of the SSM command, not the waiter. Also, if the waiter reaches max retries and the SSM command hasn't finished running, then it should fail and the function will exit with failure to be retried via SQS.

Sorry it's a bit confusing, let me know if that doesn't make sense.

kayman-mk · 2024-03-22T11:03:11Z

modules/terminate-agent-hook/lambda/lambda_function.py


    # find the executors connected to this agent and terminate them as well
-    _terminate_list = ec2_list(client=client, parent=event_detail['EC2InstanceId'])
+    _terminate_list = ec2_list(client=ec2_client, parent=instance_id)


issue: we shouldn't terminate the instances in case of a graceful shutdown. This is done by the GitLab Runner itself. Otherwise the shutdown is no longer graceful.

This section would only run after:

graceful terminate is successful
OR

we reach the graceful terminate timeout

In the first case, the SSM command should cleanup the workers using docker-machine rm, though if for some reason it fails then the workers could be left running.

In the second case, I believe it's possible for the runner instance to be terminated by the lifecycle hook before gitlab runner can remove the workers, leaving the workers orphaned.

I think having this section would act as a back up for these situations.

Yes, sounds reasonably to me. Let's keep it here.

kayman-mk · 2024-03-22T11:06:30Z

modules/terminate-agent-hook/lambda/lambda_function.py

@@ -270,7 +434,7 @@ def handler(event, context):
            "Message": "No instances to terminate."
        }))

-    remove_unused_ssh_key_pairs(client=client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE'])
+    remove_unused_ssh_key_pairs(client=ec2_client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE'])


note: as far as I believe, we can remove the SSH keys immediately. They are copied during startup into local files on the GitLab Runner instance and thus no longer needed.

I have this line after the graceful terminate section to prevent it from running multiple time, since the graceful terminate section of the code is allowed to fail and retry.

long-wan-ep · 2024-04-16T17:35:49Z

Since #1062 has been resolved, this is no longer blocked.

kayman-mk · 2024-04-24T20:42:43Z

@long-wan-ep There is another PR ongoing (#1117 ). On a first view it seems to be that their implementation is more robust. Could you have a look please and comment on that?

long-wan-ep · 2024-04-24T23:25:31Z

@kayman-mk The implementation from #1117 is much simpler and works, let's close this MR in favour of the other solution.

long-wan-ep changed the title ~~Graceful terminate~~ feat: Add graceful terminate module Mar 14, 2024

long-wan-ep force-pushed the graceful-terminate branch 2 times, most recently from 4685b74 to e2ca0d1 Compare March 15, 2024 18:42

long-wan-ep changed the title ~~feat: Add graceful terminate module~~ feat: Add graceful terminate option to terminate-agent-hook Mar 15, 2024

long-wan-ep force-pushed the graceful-terminate branch from e2ca0d1 to 7bf893c Compare March 15, 2024 18:45

long-wan-ep changed the title ~~feat: Add graceful terminate option to terminate-agent-hook~~ feat: add graceful terminate option to terminate-agent-hook Mar 15, 2024

feat: add graceful terminate option to terminate-agent-hook

7499b0b

long-wan-ep force-pushed the graceful-terminate branch from 7bf893c to 7499b0b Compare March 15, 2024 19:57

long-wan-ep marked this pull request as ready for review March 15, 2024 21:27

long-wan-ep requested review from npalm and kayman-mk as code owners March 15, 2024 21:27

kayman-mk reviewed Mar 22, 2024

View reviewed changes

variables.tf Show resolved Hide resolved

amend documentation

a2a590b

kayman-mk reviewed Mar 22, 2024

View reviewed changes

long-wan-ep and others added 2 commits March 22, 2024 12:01

Update docs

dda7d3b

Merge branch 'main' into graceful-terminate

0b2556a

long-wan-ep mentioned this pull request Apr 24, 2024

feat: implement graceful shutdown of GitLab Runner #1117

Merged

2 tasks

long-wan-ep closed this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add graceful terminate option to terminate-agent-hook #1099

feat: add graceful terminate option to terminate-agent-hook #1099

long-wan-ep commented Mar 14, 2024 •

edited

Loading

github-actions bot commented Mar 14, 2024

long-wan-ep commented Mar 15, 2024

kayman-mk commented Mar 22, 2024

kayman-mk left a comment

kayman-mk Mar 22, 2024

long-wan-ep Mar 22, 2024

kayman-mk Mar 22, 2024 •

edited

Loading

long-wan-ep Mar 22, 2024

kayman-mk Mar 22, 2024

long-wan-ep Mar 22, 2024

kayman-mk Mar 24, 2024

long-wan-ep Mar 25, 2024

kayman-mk Mar 22, 2024

long-wan-ep Mar 22, 2024

kayman-mk Mar 24, 2024

kayman-mk Mar 22, 2024

long-wan-ep Mar 22, 2024

long-wan-ep commented Apr 16, 2024

kayman-mk commented Apr 24, 2024

long-wan-ep commented Apr 24, 2024

feat: add graceful terminate option to terminate-agent-hook #1099

feat: add graceful terminate option to terminate-agent-hook #1099

Conversation

long-wan-ep commented Mar 14, 2024 • edited Loading

Description

Migrations required

Verification

With graceful terminate enabled

With graceful terminate disabled

github-actions bot commented Mar 14, 2024

long-wan-ep commented Mar 15, 2024

kayman-mk commented Mar 22, 2024

kayman-mk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayman-mk Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long-wan-ep commented Apr 16, 2024

kayman-mk commented Apr 24, 2024

long-wan-ep commented Apr 24, 2024

long-wan-ep commented Mar 14, 2024 •

edited

Loading

kayman-mk Mar 22, 2024 •

edited

Loading