Old runners not cleaned up after ASG update #214

martinhanzik · 2020-04-14T16:04:30Z

If any change triggers a user_data update and an ASG recreation, the docker machine instance is not terminated, only the manager.

Configuration:

module "runner" {
  source  = "npalm/gitlab-runner/aws"
  version = "4.14.0"

  aws_region  = var.default_region
  environment = "runner"

  vpc_id                   = data.aws_vpc.base-vpc.id
  subnet_ids_gitlab_runner = [data.aws_subnet.base-vpc-private-eu-west-1a.id, data.aws_subnet.base-vpc-private-eu-west-1b.id]
  subnet_id_runners        = data.aws_subnet.base-vpc-private-eu-west-1a.id

  runners_name       = "runner"
  runners_gitlab_url = "https://gitlab.com"
  runners_request_spot_instance = false

  enable_runner_ssm_access = true

  runners_idle_count = 1

  runners_off_peak_periods = "[\"* * 0-7,19-23 * * mon-fri *\", \"* * * * * sat,sun *\"]"
  runners_off_peak_idle_count = 1
  runners_off_peak_idle_time  = 60

  cache_shared = true

  cache_bucket = {
    create = false
    policy = aws_iam_policy.runner-cache.arn
    bucket = aws_s3_bucket.runner-cache.arn
  }

  overrides = {
    name_sg = ""
    name_runner_agent_instance  = "runner-manager"
    name_docker_machine_runners = "runner"
  }

  gitlab_runner_registration_config = {
    registration_token = var.gitlab_runner_registration_token
    tag_list           = "docker"
    description        = "Runner"
    locked_to_project  = "false"
    run_untagged       = "false"
    maximum_timeout    = "3600"
  }

  ami_filter = {
    name = ["amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs"]
  }

  runner_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-eoan-19.10-amd64-server-20200325"]
  }

  userdata_pre_install     = "sudo yum install amazon-ecr-credential-helper -y"
}

npalm · 2020-04-14T20:22:44Z

@martinhanzik I think you are right, personally I do not use the enable_forced_updates. But since this will stop the scaling group / terminate the instances in the scaling group it only terminates the agent. The docker-machine instances will not be cleared. Any suggestion for a fix?

martinhanzik · 2020-04-16T16:21:17Z

Maybe ASG events like https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html could be used - have a process trying to read the termination notification and perform a clean shutdown of all the docker-machine instances.

patramsey · 2020-05-27T17:23:53Z

@npalm updating the version of Terraform closed this issue?

npalm · 2020-05-27T20:21:40Z

The problem is that the docker machines are managed by the agent running in the ASG. Simply terminating this instance by replacing the ASG leaves thoses instances un registered and they will be never removed. Did you noticed what happen to active builds?

This means there will no events to listing for, but having an lifecyle hook + lambda could potential clean up the docker machine left overs. Still the problem will remain that the machines could be busy with processing ci workloads

kayman-mk · 2020-08-13T08:46:47Z

I had this problem some days ago while updating the CI infrastructure. My terraform job suddenly died when the ASG was recreated. The lock on the state file was never released so I assume that the build was cancelled.

kayman-mk · 2020-08-13T09:15:08Z

Let me detail @martinhanzik and @npalm solution

Add an aws_autoscaling_lifecycle_hook (event: autoscaling:EC2_INSTANCE_TERMINATING). The event metadata is forwarded into a SQS queue. Trigger a lambda function which finds (via tags which have to be present in the autoscaling group as well) the associated runners and kill them.

Without the runner agent the runners shouldn't be able to proceed. May be they finish the built but they are not able not upload anything to the gitlab instance, are they?

Should we do this?

kayman-mk · 2020-08-13T11:00:01Z

Still the problem will remain that the machines could be busy with processing ci workloads

@npalm Just tried it and killed the runner agent while a java build was running. The jobs on the runner are immediately cancelled and gitlab shows the job as "runner system failure". The docker container on the runner disappeared. So all jobs on the runner are cancelled.

Thus the only problem is to remove the runner instances as described above.

npalm · 2020-08-13T16:28:46Z

Yepz the runners are not managed by terraform, the default exampls contains some scripting to cancel spot requests and instances during destroy. I do not have a good idea for a fix.

DavidGamba · 2020-08-13T18:06:55Z

We are also facing this issue (though haven't put any time to fix it yet, everything else is working very well).

The additional piece of information is that the bundled script won't work for us because we have a multi account strategy and the box the script is running from is not in the same account as the target nodes, the script isn't taking a profile or role to assume as an argument so it seems like it is only intended for the same account.

kayman-mk · 2020-08-13T20:55:29Z

@DavidGamba Sorry, I didn't get it 100%. Which machines are running in different accounts?

We also have a multi account strategy. So the CI system runs in a different account than the product it builds. But the complete CI system runs in one account. So it should be possible to kill the runners when the runner agent is killed.

DavidGamba · 2020-08-13T22:11:03Z

@kayman-mk the machine we run our terraform plan/apply is in a different account from the one the runners are running from.
So without a profile or assuming a role we can't kill the runners when the ASG gets rebuilt.

strowi · 2020-09-29T12:15:20Z

Only starting with scaling gitlab in aws/terraform, but wouldn't it be possible to add a version-tag to the docker-machine runners and delete the instances by tag?

Similar to:

date "aws_instance" "previous" {

  filter {
    tag   = "asg"
    values = ["asg-name"]
  }

resource "aws_instance" "previous_runners" {
  ...
  instance_state = "Terminated"
}

kayman-mk · 2020-09-29T17:49:11Z

@strowi This is not possible as instance_state is read only. But using tags to identify those machines is a good idea. But it needs an additional script, lambda or something else.

strowi · 2020-10-01T14:20:52Z

@kayman-mk ah right, thx! I misread the aws_instance-docs about instance_state.

florian0410 · 2021-03-30T18:06:42Z

Hello guys,

I encountered the bug also even when using the example script https://github.com/npalm/terraform-aws-gitlab-runner/blob/develop/bin/cancel-spot-instances.sh

I saw it during destroy but it can occurs while updating ASG too , there is a small chance that before your main agent runner get removed/replaced, it spawns a Spot instance, since the cancel-spot.sh script is called before it is hard to handle. My current fix would be to add a depends_on for the asg destruction on the null ressource example https://github.com/npalm/terraform-aws-gitlab-runner/blob/8b15241b34c87d7b21c7a14fddf7d0b84f750f1b/examples/runner-default/main.tf#L106

But terraform does not support refering to a module ressource using depends_on unfortunately.
In destroy case, we have some unwanted behaviors like Cloudwatch log group being created and not fully removed since an instance still exists.

@npalm I don't know if there is another way to specify your cancel-spot.sh script to trigger only after deletion of the main instance or its ASG group but it would solve many issues while waiting for a clean solution.

Regards

dsalaza4 · 2021-08-18T18:16:44Z

I am trying to implement a fully reproducible CI with N runners where each runner is recreated on a daily basis.
This is currently the biggest issue I have in order to achieving complete reproducibility,
as every time I recreate a runner, many workers are left orphaned.

dsalaza4 · 2021-08-23T17:04:25Z

@npalm What do you think about implementing a uuid tag for a runner and all its associated workers, so we can easily identify all workers belonging to a runner. this would allow us to execute a cleaning job every time a runner is either replaced or destroyed.

The only downside I can see with this approach is that jobs inevitably fail once the runner shuts down, although this is already happening.

The next step could be figuring out a way to stop the runner from grabbing new jobs, waiting for current jobs to finish/fail/whatever and then turn it off.

# cancel-spot-instance.sh enchancement I recently used the script to mitigate issues from cattle-ops#214 so I'm proposing some updates that could help. - Remove spot keypairs associated - Ensure errors are returned using pipefail - null condition cause failure if no Spot running (unwanted)

florian0410 · 2021-08-26T15:53:27Z

If it can help,

I push #359 which is the result of multiple tests.

It does not solve fully the issue but note that you have to think about removing the associated keypairs created for each Spot.

# cancel-spot-instance.sh enchancement I recently used the script to mitigate issues from #214 so I'm proposing some updates that could help. - Remove spot keypairs associated - Ensure errors are returned using pipefail - null condition cause failure if no Spot running (unwanted)

joshbeard · 2021-10-15T20:17:12Z

My apologies for the repeated messages on this thread - didn't realize every time I rebased and pushed my branch to my fork with a commit message that linked to this issue that it'd link up here. :)

I've submitted #392 that uses a Lambda function to terminate 'orphaned' instances when the instance in the ASG is terminated/refreshed. I've been testing this for a bit with a couple of different runner deployments with success and welcome any feedback. Like @dsalaza4, I'd like for my GitLab runner deployments is to be ephemeral and refreshed regularly and this change, along with the auto refresh setting (#385), seems to be doing that. However, the caveat still exists where a runner instance can still be terminated while it's actively running a job.

joshbeard · 2022-03-02T03:50:24Z

This ☝️ feature (#392) was merged and included in the 4.40.0 release: https://github.com/npalm/terraform-aws-gitlab-runner/releases/tag/4.40.0

kayman-mk · 2022-09-01T08:59:50Z

@martinhanzik Should be fixed and the issue can be closed, can't it?

martinhanzik · 2022-09-01T09:01:38Z

I'm unable to say so myself, as I'm not using the module at this time, but I will ask my colleagues if it's working OK for them.

martinhanzik · 2022-09-01T11:43:33Z

Seems to work fine for them, thanks for the new feature! Closing.

npalm added bug 🐛 Something isn't working help wanted Extra attention is needed labels Apr 14, 2020

npalm mentioned this issue May 22, 2020

Fix #214, and update version #224

Merged

npalm closed this as completed May 22, 2020

npalm reopened this May 27, 2020

florian0410 mentioned this issue Aug 26, 2021

Update cancel-spot-instances.sh #359

Merged

joshbeard mentioned this issue Oct 15, 2021

Add ASG lifecycle management Lambda function #392

Merged

martinhanzik closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old runners not cleaned up after ASG update #214

Old runners not cleaned up after ASG update #214

martinhanzik commented Apr 14, 2020 •

edited

Loading

npalm commented Apr 14, 2020

martinhanzik commented Apr 16, 2020

patramsey commented May 27, 2020

npalm commented May 27, 2020

kayman-mk commented Aug 13, 2020

kayman-mk commented Aug 13, 2020 •

edited

Loading

kayman-mk commented Aug 13, 2020

npalm commented Aug 13, 2020

DavidGamba commented Aug 13, 2020

kayman-mk commented Aug 13, 2020 •

edited

Loading

DavidGamba commented Aug 13, 2020

strowi commented Sep 29, 2020 •

edited

Loading

kayman-mk commented Sep 29, 2020 •

edited

Loading

strowi commented Oct 1, 2020

florian0410 commented Mar 30, 2021 •

edited

Loading

dsalaza4 commented Aug 18, 2021

dsalaza4 commented Aug 23, 2021

florian0410 commented Aug 26, 2021

joshbeard commented Oct 15, 2021

joshbeard commented Mar 2, 2022

kayman-mk commented Sep 1, 2022

martinhanzik commented Sep 1, 2022

martinhanzik commented Sep 1, 2022

Old runners not cleaned up after ASG update #214

Old runners not cleaned up after ASG update #214

Comments

martinhanzik commented Apr 14, 2020 • edited Loading

npalm commented Apr 14, 2020

martinhanzik commented Apr 16, 2020

patramsey commented May 27, 2020

npalm commented May 27, 2020

kayman-mk commented Aug 13, 2020

kayman-mk commented Aug 13, 2020 • edited Loading

kayman-mk commented Aug 13, 2020

npalm commented Aug 13, 2020

DavidGamba commented Aug 13, 2020

kayman-mk commented Aug 13, 2020 • edited Loading

DavidGamba commented Aug 13, 2020

strowi commented Sep 29, 2020 • edited Loading

kayman-mk commented Sep 29, 2020 • edited Loading

strowi commented Oct 1, 2020

florian0410 commented Mar 30, 2021 • edited Loading

dsalaza4 commented Aug 18, 2021

dsalaza4 commented Aug 23, 2021

florian0410 commented Aug 26, 2021

joshbeard commented Oct 15, 2021

joshbeard commented Mar 2, 2022

kayman-mk commented Sep 1, 2022

martinhanzik commented Sep 1, 2022

martinhanzik commented Sep 1, 2022

martinhanzik commented Apr 14, 2020 •

edited

Loading

kayman-mk commented Aug 13, 2020 •

edited

Loading

kayman-mk commented Aug 13, 2020 •

edited

Loading

strowi commented Sep 29, 2020 •

edited

Loading

kayman-mk commented Sep 29, 2020 •

edited

Loading

florian0410 commented Mar 30, 2021 •

edited

Loading