-
-
Notifications
You must be signed in to change notification settings - Fork 338
feat: add graceful terminate option to terminate-agent-hook #1099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey @long-wan-ep! 👋 Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process. Make sure that this PR clearly explains:
With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE. The following ChatOps commands are supported:
Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command. This message was generated automatically. You are welcome to improve it. |
4685b74
to
e2ca0d1
Compare
e2ca0d1
to
7bf893c
Compare
7bf893c
to
7499b0b
Compare
Currently if graceful terminate is enabled, the runner is highly likely to run into #1062, since graceful terminate delays runner termination during a runner refresh. I think we should tackle that issue first before we implement this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, @long-wan-ep
I am still trying to understand the termination process. It looks like that we are not waiting until all jobs on the workers/executors are finished but we are killing them at time of GitLab Runner shutdown.
] | ||
effect = "Allow" | ||
resources = [ | ||
"arn:${data.aws_partition.current.partition}:ec2:${data.aws_region.this.name}:${data.aws_caller_identity.this.account_id}:instance/*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: can we restrict the instances with the EC2 instance tag gitlab-runner-parent-id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSM command is sent to the gitlab runner instance not the workers, which doesn't have gitlab-runner-parent-id
.
variable "runner_worker_graceful_terminate" { | ||
description = "Gracefully terminate Runner Worker, by waiting a set amount of time for running jobs to finish before termination." | ||
type = object({ | ||
enabled = optional(bool, false) | ||
timeout = optional(number, 1800) | ||
retry_period = optional(number, 300) | ||
job_timeout = optional(number, 3600) | ||
}) | ||
default = {} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: remove this variable and enable the graceful termination always. Is there any reason why someone should not use this?
Edit: Just amending the docs. It sounds like you massively improved the termination of the instances. So from my perspective there is no reason not to use it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it an optional feature because I thought there might be users who don't need/want graceful termination and want to keep the existing behavior, but I agree that it should be an improvement for most users. We could set enabled
default to true
if we want to give users the choice, or if you don't think that's necessary then we can remove the enabled
field and make it mandatory.
if command_response['Status'] == "Success": | ||
print(json.dumps({ | ||
"Level": "info", | ||
"Message": f"gitlab-runner service stopped, SSM command response: {command_response}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: This seems to be incorrect as we do not know if the service has been stopped already. It might still be running, waiting for jobs to be finished, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSM command sent to the runner instance will stop the gitlab-runner service and only return success if the service is inactive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the waiter above returns only in case the service has been stopped? This could last up to 1800s (by default), but as far as I understand the waiter, it definitely returns after 3*10 seconds, doesn't it? Is this a problem? Wouldn't we return a failure then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSM command sent to the instance does not wait for the service to finish stopping(1800s), it will:
- attempt to stop
gitlab-runner.service
(non blocking) - wait 5s
- check the status of
gitlab-runner.service
- exit 0(SSM command status = "Success") if the service is stopped and docker machines removed, exit 1(SSM command status = "Failed") otherwise
This if
statement checks the status of the SSM command, not the waiter. Also, if the waiter reaches max retries and the SSM command hasn't finished running, then it should fail and the function will exit with failure to be retried via SQS.
Sorry it's a bit confusing, let me know if that doesn't make sense.
|
||
# find the executors connected to this agent and terminate them as well | ||
_terminate_list = ec2_list(client=client, parent=event_detail['EC2InstanceId']) | ||
_terminate_list = ec2_list(client=ec2_client, parent=instance_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: we shouldn't terminate the instances in case of a graceful shutdown. This is done by the GitLab Runner itself. Otherwise the shutdown is no longer graceful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section would only run after:
- graceful terminate is successful
OR - we reach the graceful terminate timeout
In the first case, the SSM command should cleanup the workers using docker-machine rm
, though if for some reason it fails then the workers could be left running.
In the second case, I believe it's possible for the runner instance to be terminated by the lifecycle hook before gitlab runner can remove the workers, leaving the workers orphaned.
I think having this section would act as a back up for these situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds reasonably to me. Let's keep it here.
@@ -270,7 +434,7 @@ def handler(event, context): | |||
"Message": "No instances to terminate." | |||
})) | |||
|
|||
remove_unused_ssh_key_pairs(client=client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE']) | |||
remove_unused_ssh_key_pairs(client=ec2_client, executor_name_part=os.environ['NAME_EXECUTOR_INSTANCE']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: as far as I believe, we can remove the SSH keys immediately. They are copied during startup into local files on the GitLab Runner instance and thus no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this line after the graceful terminate section to prevent it from running multiple time, since the graceful terminate section of the code is allowed to fail and retry.
Since #1062 has been resolved, this is no longer blocked. |
@long-wan-ep There is another PR ongoing (#1117 ). On a first view it seems to be that their implementation is more robust. Could you have a look please and comment on that? |
@kayman-mk The implementation from #1117 is much simpler and works, let's close this MR in favour of the other solution. |
Description
Resolves #1029.
Adds optional functionality to the
terminate-agent-hook
module for graceful termination, which users can enabled/configure using new input variables. When graceful terminate is enabled, the lambda will give running jobs a chance to finish before the runner instances are terminated. When graceful terminate is disabled, the original behavior of theterminate-agent-hook
is used.Updated
.pylintrc
to use/usr/local/lib/python3.12/site-packages/
, because pylint was not able to find the boto3 dependencies in/usr/local/lib/python3.11/site-packages/
. Seems like megalinter is using python 3.12 now.Migrations required
No
Verification
With graceful terminate enabled
With graceful terminate disabled