Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

skippaydevs · 2025-03-20T09:53:12Z

Describe the bug

We are experiencing periodic failures with our GitLab Runner setup on AWS using the cattle-ops/gitlab-runner/aws Terraform module.

Configuration

module "general_runner_aarch64_spot_fleet" {
  source  = "cattle-ops/gitlab-runner/aws"
  version = "~> 9.1"

  environment = local.runners["aarch64"].name

  vpc_id    = module.vpc.vpc_id
  subnet_id = module.vpc.private_subnets[0]

  runner_worker_cache = {
    shared        = true
    bucket        = module.general_runner_cache.bucket
    create        = false
    policy        = module.general_runner_cache.policy_arn
    random_suffix = false
  }

  runner_worker_docker_machine_ami_filter = {
    name = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-arm64-server-*"]
  }

  runner_worker_docker_options = {
    privileged = true
    volumes    = ["/cache", "/certs/client"]
  }

  runner_instance = {
    type       = "t3a.micro"
    ssm_access = true
    name       = "docker-machine"
  }

  runner_role = {
    role_profile_name = local.runners["aarch64"].name
    policy_arns       = [aws_iam_policy.gitlab_cache_ecr_access.arn]
  }

  runner_manager = {
    maximum_concurrent_jobs = 100
  }

  runner_install = {
    amazon_ecr_credential_helper = true
  }

  runner_gitlab = {
    runner_version                                = var.gitlab_runner_version
    url                                           = var.gitlab_instance_url
    preregistered_runner_token_ssm_parameter_name = ...
  }

  runner_worker = {
    environment_variables = [for x in local.runner_environment_vars : "${x.name}=${x.value}"]
  }

  runner_worker_docker_machine_instance = {
    root_size   = 40
    volume_type = "gp3"
    types       = ["c8g.2xlarge", "c8g.4xlarge", "r8g.2xlarge"]
    subnet_ids  = module.vpc.private_subnets
  }

  runner_worker_docker_machine_instance_spot = {
    max_price = "on-demand-price"
  }

  runner_networking = {
    allow_incoming_ping_security_group_ids = [data.aws_security_group.default.id]
  }

  runner_worker_docker_machine_role = {
    policy_arns = [aws_iam_policy.gitlab_cache_ecr_access.arn]
    tag_list    = join(",", concat(var.gitlab_runner_tags, ["aarch64", "general"]))
  }

  runner_worker_docker_machine_fleet = {
    enable = true
  }
}

Issue

Approximately every two weeks, we observe a large number of the following error messages in the logs:

Error attempting to get plugin server address for RPC: Failed to dial the plugin server in 10s

When this error occurs:

The number of EC2 instances spikes to the configured maximum.
Instances appear to be stuck: they are not processing any jobs and are also not being automatically removed.
All jobs in GitLab become stuck with the error:
```
ERROR: Preparation failed: exit status 1
```
Manually terminating the affected EC2 instances sometimes restores normal operation, but it seems to be occurring for a period of time and we just need to keep removing EC2 instances manually for that period of time.

To Reproduce

The issue is intermittent and occurs roughly every two weeks. We have not yet identified a consistent way to reproduce it. However, we have observed that the RPC error perfectly matches the occurrence of the issue.

Expected behavior

EC2 instances should not get stuck randomly. Runners should clean up properly and continue processing jobs without requiring manual intervention.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

skippaydevs commented Mar 20, 2025

Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

Runners getting stuck: RPC: Failed to dial the plugin server in 10s #1269

Comments

skippaydevs commented Mar 20, 2025

Describe the bug

Configuration

Issue

To Reproduce

Expected behavior