Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

Open
skippaydevs opened this issue Mar 20, 2025 · 0 comments
Open

Comments

@skippaydevs
Copy link

Describe the bug

We are experiencing periodic failures with our GitLab Runner setup on AWS using the cattle-ops/gitlab-runner/aws Terraform module.

Configuration

module "general_runner_aarch64_spot_fleet" {
  source  = "cattle-ops/gitlab-runner/aws"
  version = "~> 9.1"

  environment = local.runners["aarch64"].name

  vpc_id    = module.vpc.vpc_id
  subnet_id = module.vpc.private_subnets[0]

  runner_worker_cache = {
    shared        = true
    bucket        = module.general_runner_cache.bucket
    create        = false
    policy        = module.general_runner_cache.policy_arn
    random_suffix = false
  }

  runner_worker_docker_machine_ami_filter = {
    name = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-arm64-server-*"]
  }

  runner_worker_docker_options = {
    privileged = true
    volumes    = ["/cache", "/certs/client"]
  }

  runner_instance = {
    type       = "t3a.micro"
    ssm_access = true
    name       = "docker-machine"
  }

  runner_role = {
    role_profile_name = local.runners["aarch64"].name
    policy_arns       = [aws_iam_policy.gitlab_cache_ecr_access.arn]
  }

  runner_manager = {
    maximum_concurrent_jobs = 100
  }

  runner_install = {
    amazon_ecr_credential_helper = true
  }

  runner_gitlab = {
    runner_version                                = var.gitlab_runner_version
    url                                           = var.gitlab_instance_url
    preregistered_runner_token_ssm_parameter_name = ...
  }

  runner_worker = {
    environment_variables = [for x in local.runner_environment_vars : "${x.name}=${x.value}"]
  }

  runner_worker_docker_machine_instance = {
    root_size   = 40
    volume_type = "gp3"
    types       = ["c8g.2xlarge", "c8g.4xlarge", "r8g.2xlarge"]
    subnet_ids  = module.vpc.private_subnets
  }

  runner_worker_docker_machine_instance_spot = {
    max_price = "on-demand-price"
  }

  runner_networking = {
    allow_incoming_ping_security_group_ids = [data.aws_security_group.default.id]
  }

  runner_worker_docker_machine_role = {
    policy_arns = [aws_iam_policy.gitlab_cache_ecr_access.arn]
    tag_list    = join(",", concat(var.gitlab_runner_tags, ["aarch64", "general"]))
  }

  runner_worker_docker_machine_fleet = {
    enable = true
  }
}

Issue

Approximately every two weeks, we observe a large number of the following error messages in the logs:

Error attempting to get plugin server address for RPC: Failed to dial the plugin server in 10s

When this error occurs:

  • The number of EC2 instances spikes to the configured maximum.
  • Instances appear to be stuck: they are not processing any jobs and are also not being automatically removed.
  • All jobs in GitLab become stuck with the error:
    ERROR: Preparation failed: exit status 1
    
  • Manually terminating the affected EC2 instances sometimes restores normal operation, but it seems to be occurring for a period of time and we just need to keep removing EC2 instances manually for that period of time.

To Reproduce

The issue is intermittent and occurs roughly every two weeks. We have not yet identified a consistent way to reproduce it. However, we have observed that the RPC error perfectly matches the occurrence of the issue.

Expected behavior

EC2 instances should not get stuck randomly. Runners should clean up properly and continue processing jobs without requiring manual intervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant