Skip to content

feat: add graceful terminate option to terminate-agent-hook #1099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

long-wan-ep
Copy link
Contributor

@long-wan-ep long-wan-ep commented Mar 14, 2024

Description

Resolves #1029.

Adds optional functionality to the terminate-agent-hook module for graceful termination, which users can enabled/configure using new input variables. When graceful terminate is enabled, the lambda will give running jobs a chance to finish before the runner instances are terminated. When graceful terminate is disabled, the original behavior of the terminate-agent-hook is used.

Updated .pylintrc to use /usr/local/lib/python3.12/site-packages/, because pylint was not able to find the boto3 dependencies in /usr/local/lib/python3.11/site-packages/. Seems like megalinter is using python 3.12 now.

Migrations required

No

Verification

With graceful terminate enabled

  1. Manually deployed a runner with:
runner_worker_graceful_terminate = {
    enabled      = true
    retry_period = 60
    timeout      = 300
}
  1. Started a GitLab job using the runner
  2. Manually triggered instance refresh in the runner ASG
  3. Verified the new graceful termination steps in the lambda runs successfully
  4. Verified the existing steps in the lambda runs successfully

With graceful terminate disabled

  1. Manually deployed a runner
  2. Started a GitLab job using the runner
  3. Manually triggered instance refresh in the runner ASG
  4. Verified the lambda runs successfully without graceful termination steps

Copy link
Contributor

Hey @long-wan-ep! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

  • the problem being solved
  • the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

  • /help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

@long-wan-ep long-wan-ep changed the title Graceful terminate feat: Add graceful terminate module Mar 14, 2024
@long-wan-ep long-wan-ep force-pushed the graceful-terminate branch 2 times, most recently from 4685b74 to e2ca0d1 Compare March 15, 2024 18:42
@long-wan-ep long-wan-ep changed the title feat: Add graceful terminate module feat: Add graceful terminate option to terminate-agent-hook Mar 15, 2024
@long-wan-ep long-wan-ep changed the title feat: Add graceful terminate option to terminate-agent-hook feat: add graceful terminate option to terminate-agent-hook Mar 15, 2024
@long-wan-ep long-wan-ep marked this pull request as ready for review March 15, 2024 21:27
@long-wan-ep
Copy link
Contributor Author

Currently if graceful terminate is enabled, the runner is highly likely to run into #1062, since graceful terminate delays runner termination during a runner refresh. I think we should tackle that issue first before we implement this feature.

@kayman-mk
Copy link
Collaborator

#1062 will be solved via #1102 soon.

Copy link
Collaborator

@kayman-mk kayman-mk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, @long-wan-ep

I am still trying to understand the termination process. It looks like that we are not waiting until all jobs on the workers/executors are finished but we are killing them at time of GitLab Runner shutdown.

]
effect = "Allow"
resources = [
"arn:${data.aws_partition.current.partition}:ec2:${data.aws_region.this.name}:${data.aws_caller_identity.this.account_id}:instance/*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can we restrict the instances with the EC2 instance tag gitlab-runner-parent-id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSM command is sent to the gitlab runner instance not the workers, which doesn't have gitlab-runner-parent-id.

Comment on lines +753 to +762
variable "runner_worker_graceful_terminate" {
description = "Gracefully terminate Runner Worker, by waiting a set amount of time for running jobs to finish before termination."
type = object({
enabled = optional(bool, false)
timeout = optional(number, 1800)
retry_period = optional(number, 300)
job_timeout = optional(number, 3600)
})
default = {}
}
Copy link
Collaborator

@kayman-mk kayman-mk Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: remove this variable and enable the graceful termination always. Is there any reason why someone should not use this?

Edit: Just amending the docs. It sounds like you massively improved the termination of the instances. So from my perspective there is no reason not to use it by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it an optional feature because I thought there might be users who don't need/want graceful termination and want to keep the existing behavior, but I agree that it should be an improvement for most users. We could set enabled default to true if we want to give users the choice, or if you don't think that's necessary then we can remove the enabled field and make it mandatory.

if command_response['Status'] == "Success":
print(json.dumps({
"Level": "info",
"Message": f"gitlab-runner service stopped, SSM command response: {command_response}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: This seems to be incorrect as we do not know if the service has been stopped already. It might still be running, waiting for jobs to be finished, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSM command sent to the runner instance will stop the gitlab-runner service and only return success if the service is inactive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the waiter above returns only in case the service has been stopped? This could last up to 1800s (by default), but as far as I understand the waiter, it definitely returns after 3*10 seconds, doesn't it? Is this a problem? Wouldn't we return a failure then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSM command sent to the instance does not wait for the service to finish stopping(1800s), it will:

  1. attempt to stop gitlab-runner.service(non blocking)
  2. wait 5s
  3. check the status of gitlab-runner.service
  4. exit 0(SSM command status = "Success") if the service is stopped and docker machines removed, exit 1(SSM command status = "Failed") otherwise

This if statement checks the status of the SSM command, not the waiter. Also, if the waiter reaches max retries and the SSM command hasn't finished running, then it should fail and the function will exit with failure to be retried via SQS.

Sorry it's a bit confusing, let me know if that doesn't make sense.


# find the executors connected to this agent and terminate them as well
_terminate_list = ec2_list(client=client, parent=event_detail['EC2InstanceId'])
_terminate_list = ec2_list(client=ec2_client, parent=instance_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: we shouldn't terminate the instances in case of a graceful shutdown. This is done by the GitLab Runner itself. Otherwise the shutdown is no longer graceful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section would only run after:

  1. graceful terminate is successful
    OR
  2. we reach the graceful terminate timeout

In the first case, the SSM command should cleanup the workers using docker-machine rm, though if for some reason it fails then the workers could be left running.

In the second case, I believe it's possible for the runner instance to be terminated by the lifecycle hook before gitlab runner can remove the workers, leaving the workers orphaned.

I think having this section would act as a back up for these situations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds reasonably to me. Let's keep it here.

@@ -270,7 +434,7 @@ def handler(event, context):
"Message": "No instances to terminate."
}))

remove_unused_ssh_key_pairs(client=client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE'])
remove_unused_ssh_key_pairs(client=ec2_client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: as far as I believe, we can remove the SSH keys immediately. They are copied during startup into local files on the GitLab Runner instance and thus no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this line after the graceful terminate section to prevent it from running multiple time, since the graceful terminate section of the code is allowed to fail and retry.

@long-wan-ep
Copy link
Contributor Author

Since #1062 has been resolved, this is no longer blocked.

@kayman-mk
Copy link
Collaborator

@long-wan-ep There is another PR ongoing (#1117 ). On a first view it seems to be that their implementation is more robust. Could you have a look please and comment on that?

@long-wan-ep
Copy link
Contributor Author

@kayman-mk The implementation from #1117 is much simpler and works, let's close this MR in favour of the other solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to gracefully terminate runner
2 participants