Skip to content

Old runners not cleaned up after ASG update #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martinhanzik opened this issue Apr 14, 2020 · 23 comments
Closed

Old runners not cleaned up after ASG update #214

martinhanzik opened this issue Apr 14, 2020 · 23 comments
Labels
bug 🐛 Something isn't working help wanted Extra attention is needed

Comments

@martinhanzik
Copy link

martinhanzik commented Apr 14, 2020

If any change triggers a user_data update and an ASG recreation, the docker machine instance is not terminated, only the manager.

Configuration:

module "runner" {
  source  = "npalm/gitlab-runner/aws"
  version = "4.14.0"

  aws_region  = var.default_region
  environment = "runner"

  vpc_id                   = data.aws_vpc.base-vpc.id
  subnet_ids_gitlab_runner = [data.aws_subnet.base-vpc-private-eu-west-1a.id, data.aws_subnet.base-vpc-private-eu-west-1b.id]
  subnet_id_runners        = data.aws_subnet.base-vpc-private-eu-west-1a.id

  runners_name       = "runner"
  runners_gitlab_url = "https://gitlab.com"
  runners_request_spot_instance = false

  enable_runner_ssm_access = true

  runners_idle_count = 1

  runners_off_peak_periods = "[\"* * 0-7,19-23 * * mon-fri *\", \"* * * * * sat,sun *\"]"
  runners_off_peak_idle_count = 1
  runners_off_peak_idle_time  = 60

  cache_shared = true

  cache_bucket = {
    create = false
    policy = aws_iam_policy.runner-cache.arn
    bucket = aws_s3_bucket.runner-cache.arn
  }

  overrides = {
    name_sg = ""
    name_runner_agent_instance  = "runner-manager"
    name_docker_machine_runners = "runner"
  }

  gitlab_runner_registration_config = {
    registration_token = var.gitlab_runner_registration_token
    tag_list           = "docker"
    description        = "Runner"
    locked_to_project  = "false"
    run_untagged       = "false"
    maximum_timeout    = "3600"
  }

  ami_filter = {
    name = ["amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs"]
  }

  runner_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-eoan-19.10-amd64-server-20200325"]
  }

  userdata_pre_install     = "sudo yum install amazon-ecr-credential-helper -y"
}
@npalm
Copy link
Collaborator

npalm commented Apr 14, 2020

@martinhanzik I think you are right, personally I do not use the enable_forced_updates. But since this will stop the scaling group / terminate the instances in the scaling group it only terminates the agent. The docker-machine instances will not be cleared. Any suggestion for a fix?

@npalm npalm added bug 🐛 Something isn't working help wanted Extra attention is needed labels Apr 14, 2020
@martinhanzik
Copy link
Author

Maybe ASG events like https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html could be used - have a process trying to read the termination notification and perform a clean shutdown of all the docker-machine instances.

@patramsey
Copy link

@npalm updating the version of Terraform closed this issue?

@npalm npalm reopened this May 27, 2020
@npalm
Copy link
Collaborator

npalm commented May 27, 2020

The problem is that the docker machines are managed by the agent running in the ASG. Simply terminating this instance by replacing the ASG leaves thoses instances un registered and they will be never removed. Did you noticed what happen to active builds?

This means there will no events to listing for, but having an lifecyle hook + lambda could potential clean up the docker machine left overs. Still the problem will remain that the machines could be busy with processing ci workloads

@kayman-mk
Copy link
Collaborator

I had this problem some days ago while updating the CI infrastructure. My terraform job suddenly died when the ASG was recreated. The lock on the state file was never released so I assume that the build was cancelled.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Aug 13, 2020

Let me detail @martinhanzik and @npalm solution

Add an aws_autoscaling_lifecycle_hook (event: autoscaling:EC2_INSTANCE_TERMINATING). The event metadata is forwarded into a SQS queue. Trigger a lambda function which finds (via tags which have to be present in the autoscaling group as well) the associated runners and kill them.

Without the runner agent the runners shouldn't be able to proceed. May be they finish the built but they are not able not upload anything to the gitlab instance, are they?

Should we do this?

@kayman-mk
Copy link
Collaborator

Still the problem will remain that the machines could be busy with processing ci workloads

@npalm Just tried it and killed the runner agent while a java build was running. The jobs on the runner are immediately cancelled and gitlab shows the job as "runner system failure". The docker container on the runner disappeared. So all jobs on the runner are cancelled.

Thus the only problem is to remove the runner instances as described above.

@npalm
Copy link
Collaborator

npalm commented Aug 13, 2020

Yepz the runners are not managed by terraform, the default exampls contains some scripting to cancel spot requests and instances during destroy. I do not have a good idea for a fix.

@DavidGamba
Copy link
Contributor

We are also facing this issue (though haven't put any time to fix it yet, everything else is working very well).

The additional piece of information is that the bundled script won't work for us because we have a multi account strategy and the box the script is running from is not in the same account as the target nodes, the script isn't taking a profile or role to assume as an argument so it seems like it is only intended for the same account.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Aug 13, 2020

@DavidGamba Sorry, I didn't get it 100%. Which machines are running in different accounts?

We also have a multi account strategy. So the CI system runs in a different account than the product it builds. But the complete CI system runs in one account. So it should be possible to kill the runners when the runner agent is killed.

@DavidGamba
Copy link
Contributor

@kayman-mk the machine we run our terraform plan/apply is in a different account from the one the runners are running from.
So without a profile or assuming a role we can't kill the runners when the ASG gets rebuilt.

@strowi
Copy link

strowi commented Sep 29, 2020

Only starting with scaling gitlab in aws/terraform, but wouldn't it be possible to add a version-tag to the docker-machine runners and delete the instances by tag?

Similar to:

date "aws_instance" "previous" {

  filter {
    tag   = "asg"
    values = ["asg-name"]
  }

resource "aws_instance" "previous_runners" {
  ...
  instance_state = "Terminated"
}

@kayman-mk
Copy link
Collaborator

kayman-mk commented Sep 29, 2020

@strowi This is not possible as instance_state is read only. But using tags to identify those machines is a good idea. But it needs an additional script, lambda or something else.

@strowi
Copy link

strowi commented Oct 1, 2020

@kayman-mk ah right, thx! I misread the aws_instance-docs about instance_state.

@florian0410
Copy link
Contributor

florian0410 commented Mar 30, 2021

Hello guys,

I encountered the bug also even when using the example script https://github.com/npalm/terraform-aws-gitlab-runner/blob/develop/bin/cancel-spot-instances.sh

I saw it during destroy but it can occurs while updating ASG too , there is a small chance that before your main agent runner get removed/replaced, it spawns a Spot instance, since the cancel-spot.sh script is called before it is hard to handle. My current fix would be to add a depends_on for the asg destruction on the null ressource example https://github.com/npalm/terraform-aws-gitlab-runner/blob/8b15241b34c87d7b21c7a14fddf7d0b84f750f1b/examples/runner-default/main.tf#L106

But terraform does not support refering to a module ressource using depends_on unfortunately.
In destroy case, we have some unwanted behaviors like Cloudwatch log group being created and not fully removed since an instance still exists.

@npalm I don't know if there is another way to specify your cancel-spot.sh script to trigger only after deletion of the main instance or its ASG group but it would solve many issues while waiting for a clean solution.

Regards

@dsalaza4
Copy link

I am trying to implement a fully reproducible CI with N runners where each runner is recreated on a daily basis.
This is currently the biggest issue I have in order to achieving complete reproducibility,
as every time I recreate a runner, many workers are left orphaned.

@dsalaza4
Copy link

@npalm What do you think about implementing a uuid tag for a runner and all its associated workers, so we can easily identify all workers belonging to a runner. this would allow us to execute a cleaning job every time a runner is either replaced or destroyed.

The only downside I can see with this approach is that jobs inevitably fail once the runner shuts down, although this is already happening.

The next step could be figuring out a way to stop the runner from grabbing new jobs, waiting for current jobs to finish/fail/whatever and then turn it off.

florian0410 added a commit to florian0410/terraform-aws-gitlab-runner that referenced this issue Aug 26, 2021
# cancel-spot-instance.sh enchancement

I recently used the script to mitigate issues from cattle-ops#214 so I'm proposing some updates that could help.

- Remove spot keypairs associated
- Ensure errors are returned using pipefail
- null condition cause failure if no Spot running (unwanted)
@florian0410
Copy link
Contributor

If it can help,

I push #359 which is the result of multiple tests.

It does not solve fully the issue but note that you have to think about removing the associated keypairs created for each Spot.

npalm pushed a commit that referenced this issue Aug 28, 2021
# cancel-spot-instance.sh enchancement

I recently used the script to mitigate issues from #214 so I'm proposing some updates that could help.

- Remove spot keypairs associated
- Ensure errors are returned using pipefail
- null condition cause failure if no Spot running (unwanted)
@joshbeard
Copy link
Contributor

My apologies for the repeated messages on this thread - didn't realize every time I rebased and pushed my branch to my fork with a commit message that linked to this issue that it'd link up here. :)

I've submitted #392 that uses a Lambda function to terminate 'orphaned' instances when the instance in the ASG is terminated/refreshed. I've been testing this for a bit with a couple of different runner deployments with success and welcome any feedback. Like @dsalaza4, I'd like for my GitLab runner deployments is to be ephemeral and refreshed regularly and this change, along with the auto refresh setting (#385), seems to be doing that. However, the caveat still exists where a runner instance can still be terminated while it's actively running a job.

@joshbeard
Copy link
Contributor

This ☝️ feature (#392) was merged and included in the 4.40.0 release: https://github.com/npalm/terraform-aws-gitlab-runner/releases/tag/4.40.0

@kayman-mk
Copy link
Collaborator

@martinhanzik Should be fixed and the issue can be closed, can't it?

@martinhanzik
Copy link
Author

I'm unable to say so myself, as I'm not using the module at this time, but I will ask my colleagues if it's working OK for them.

@martinhanzik
Copy link
Author

Seems to work fine for them, thanks for the new feature! Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants