-
-
Notifications
You must be signed in to change notification settings - Fork 338
Old runners not cleaned up after ASG update #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@martinhanzik I think you are right, personally I do not use the |
Maybe ASG events like https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html could be used - have a process trying to read the termination notification and perform a clean shutdown of all the docker-machine instances. |
@npalm updating the version of Terraform closed this issue? |
The problem is that the docker machines are managed by the agent running in the ASG. Simply terminating this instance by replacing the ASG leaves thoses instances un registered and they will be never removed. Did you noticed what happen to active builds? This means there will no events to listing for, but having an lifecyle hook + lambda could potential clean up the docker machine left overs. Still the problem will remain that the machines could be busy with processing ci workloads |
I had this problem some days ago while updating the CI infrastructure. My terraform job suddenly died when the ASG was recreated. The lock on the state file was never released so I assume that the build was cancelled. |
Let me detail @martinhanzik and @npalm solution Add an aws_autoscaling_lifecycle_hook (event: autoscaling:EC2_INSTANCE_TERMINATING). The event metadata is forwarded into a SQS queue. Trigger a lambda function which finds (via tags which have to be present in the autoscaling group as well) the associated runners and kill them. Without the runner agent the runners shouldn't be able to proceed. May be they finish the built but they are not able not upload anything to the gitlab instance, are they? Should we do this? |
@npalm Just tried it and killed the runner agent while a java build was running. The jobs on the runner are immediately cancelled and gitlab shows the job as "runner system failure". The docker container on the runner disappeared. So all jobs on the runner are cancelled. Thus the only problem is to remove the runner instances as described above. |
Yepz the runners are not managed by terraform, the default exampls contains some scripting to cancel spot requests and instances during destroy. I do not have a good idea for a fix. |
We are also facing this issue (though haven't put any time to fix it yet, everything else is working very well). The additional piece of information is that the bundled script won't work for us because we have a multi account strategy and the box the script is running from is not in the same account as the target nodes, the script isn't taking a profile or role to assume as an argument so it seems like it is only intended for the same account. |
@DavidGamba Sorry, I didn't get it 100%. Which machines are running in different accounts? We also have a multi account strategy. So the CI system runs in a different account than the product it builds. But the complete CI system runs in one account. So it should be possible to kill the runners when the runner agent is killed. |
@kayman-mk the machine we run our terraform plan/apply is in a different account from the one the runners are running from. |
Only starting with scaling gitlab in aws/terraform, but wouldn't it be possible to add a version-tag to the docker-machine runners and delete the instances by tag? Similar to:
|
@strowi This is not possible as instance_state is read only. But using tags to identify those machines is a good idea. But it needs an additional script, lambda or something else. |
@kayman-mk ah right, thx! I misread the aws_instance-docs about instance_state. |
Hello guys, I encountered the bug also even when using the example script https://github.com/npalm/terraform-aws-gitlab-runner/blob/develop/bin/cancel-spot-instances.sh I saw it during destroy but it can occurs while updating ASG too , there is a small chance that before your main agent runner get removed/replaced, it spawns a Spot instance, since the cancel-spot.sh script is called before it is hard to handle. My current fix would be to add a depends_on for the asg destruction on the null ressource example https://github.com/npalm/terraform-aws-gitlab-runner/blob/8b15241b34c87d7b21c7a14fddf7d0b84f750f1b/examples/runner-default/main.tf#L106 But terraform does not support refering to a module ressource using depends_on unfortunately. @npalm I don't know if there is another way to specify your cancel-spot.sh script to trigger only after deletion of the main instance or its ASG group but it would solve many issues while waiting for a clean solution. Regards |
I am trying to implement a fully reproducible CI with N runners where each runner is recreated on a daily basis. |
@npalm What do you think about implementing a The only downside I can see with this approach is that jobs inevitably fail once the runner shuts down, although this is already happening. The next step could be figuring out a way to stop the runner from grabbing new jobs, waiting for current jobs to finish/fail/whatever and then turn it off. |
# cancel-spot-instance.sh enchancement I recently used the script to mitigate issues from cattle-ops#214 so I'm proposing some updates that could help. - Remove spot keypairs associated - Ensure errors are returned using pipefail - null condition cause failure if no Spot running (unwanted)
If it can help, I push #359 which is the result of multiple tests. It does not solve fully the issue but note that you have to think about removing the associated keypairs created for each Spot. |
# cancel-spot-instance.sh enchancement I recently used the script to mitigate issues from #214 so I'm proposing some updates that could help. - Remove spot keypairs associated - Ensure errors are returned using pipefail - null condition cause failure if no Spot running (unwanted)
My apologies for the repeated messages on this thread - didn't realize every time I rebased and pushed my branch to my fork with a commit message that linked to this issue that it'd link up here. :) I've submitted #392 that uses a Lambda function to terminate 'orphaned' instances when the instance in the ASG is terminated/refreshed. I've been testing this for a bit with a couple of different runner deployments with success and welcome any feedback. Like @dsalaza4, I'd like for my GitLab runner deployments is to be ephemeral and refreshed regularly and this change, along with the auto refresh setting (#385), seems to be doing that. However, the caveat still exists where a runner instance can still be terminated while it's actively running a job. |
This ☝️ feature (#392) was merged and included in the 4.40.0 release: https://github.com/npalm/terraform-aws-gitlab-runner/releases/tag/4.40.0 |
@martinhanzik Should be fixed and the issue can be closed, can't it? |
I'm unable to say so myself, as I'm not using the module at this time, but I will ask my colleagues if it's working OK for them. |
Seems to work fine for them, thanks for the new feature! Closing. |
If any change triggers a user_data update and an ASG recreation, the docker machine instance is not terminated, only the manager.
Configuration:
The text was updated successfully, but these errors were encountered: