Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 6cde62c

Browse files
authored
feat: mark orphan runners before removing them (#4001)
## Problem Orphan runners are deleted right after detection. This can be clash with self termination (ephemeral) runners. Typically the runner is waiting a few sseconds before exectuing a self termination. ## Solution In this solution we first mark a runner orphan, but not delete the runner. In a next cycle of the scale down function. First all orphan runners are terminated. ## Improvements - Improved logging, only logging the main flow once at info. All other logs moved to debug - Scale-down write permissions limitted to the envirnoment ## Todo - [x] Update docs - [x] Test default runner deployment - [x] Test mult runner deployment ## Example of log - Two instances - One made orphan by removing the runner from GitHub - In the log - Idle runner got removed - Orphan get marked as orphan - Next cycle orphan terminated. <img width="1283" alt="image" src="https://github.com/user-attachments/assets/c7cb5372-f32c-4fc4-81bc-8aacec2a483f">
1 parent 3dbd40c commit 6cde62c

File tree

10 files changed

+204
-47
lines changed

10 files changed

+204
-47
lines changed

docs/index.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ The "Scale Up Runner" Lambda actively monitors the SQS queue, processing incomin
4646

4747
The Lambda first requests a JIT configuration or registration token from GitHub, which is needed later by the runner to register itself. This avoids the case that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a [`user_data`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) script. This script will install the required software and configure it. The configuration for the runner is shared via EC2 tags and the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.
4848

49-
The current method for scaling down runners employs a straightforward approach: at predefined intervals, the Lambda conducts a thorough examination of each runner (instance) to assess its activity. If a runner is found to be idle, it is deregistered from GitHub, and the associated AWS instance is terminated. For ephemeral runners the the instance is terminated immediately after the workflow is finished. To avoid orphaned runners the scale down lambda is active in this cae as well.
49+
The current method for scaling down runners employs a straightforward approach: at predefined intervals, the Lambda conducts a thorough examination of each runner (instance) to assess its activity. If a runner is found to be idle, it is deregistered from GitHub, and the associated AWS instance is terminated. For ephemeral runners the the instance is terminated immediately after the workflow is finished. Instances not registered in GitHub as a runner after a minimal boot time will be marked orphan and removed in a next cycle. To avoid orphaned runners the scale down lambda is active in this cae as well.
5050

5151
### Pool
5252

@@ -68,7 +68,7 @@ The AMI cleaner is a lambda that will clean up AMIs that are older than a config
6868

6969
> This feature is Beta, changes will not trigger a major release as long in beta.
7070
71-
The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.
71+
The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.
7272

7373
### Security
7474

lambdas/functions/control-plane/jest.config.ts

+4-4
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ const config: Config = {
66
...defaultConfig,
77
coverageThreshold: {
88
global: {
9-
statements: 97.79,
10-
branches: 96.13,
11-
functions: 95.4,
12-
lines: 98.06,
9+
statements: 98.01,
10+
branches: 97.28,
11+
functions: 95.6,
12+
lines: 97.94,
1313
},
1414
},
1515
};

lambdas/functions/control-plane/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"test": "NODE_ENV=test nx test",
99
"test:watch": "NODE_ENV=test nx test --watch",
1010
"lint": "yarn eslint src",
11-
"watch": "ts-node-dev --respawn --exit-child --files src/local.ts",
11+
"watch": "ts-node-dev --respawn --exit-child --files src/local-down.ts",
1212
"build": "ncc build src/lambda.ts -o dist",
1313
"dist": "yarn build && cd dist && zip ../runners.zip index.js",
1414
"format": "prettier --write \"**/*.ts\"",

lambdas/functions/control-plane/src/aws/runners.d.ts

+2
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ export interface RunnerList {
99
type?: string;
1010
repo?: string;
1111
org?: string;
12+
orphan?: boolean;
1213
}
1314

1415
export interface RunnerInfo {
@@ -22,6 +23,7 @@ export interface ListRunnerFilters {
2223
runnerType?: RunnerType;
2324
runnerOwner?: string;
2425
environment?: string;
26+
orphan?: boolean;
2527
statuses?: string[];
2628
}
2729

lambdas/functions/control-plane/src/aws/runners.test.ts

+56-1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ import {
33
CreateFleetCommandInput,
44
CreateFleetInstance,
55
CreateFleetResult,
6+
CreateTagsCommand,
67
DefaultTargetCapacityType,
78
DescribeInstancesCommand,
89
DescribeInstancesResult,
@@ -16,7 +17,7 @@ import { mockClient } from 'aws-sdk-client-mock';
1617
import 'aws-sdk-client-mock-jest';
1718

1819
import ScaleError from './../scale-runners/ScaleError';
19-
import { createRunner, listEC2Runners, terminateRunner } from './runners';
20+
import { createRunner, listEC2Runners, tag, terminateRunner } from './runners';
2021
import { RunnerInfo, RunnerInputParameters, RunnerType } from './runners.d';
2122

2223
process.env.AWS_REGION = 'eu-east-1';
@@ -67,6 +68,23 @@ describe('list instances', () => {
6768
launchTime: new Date('2020-10-10T14:48:00.000+09:00'),
6869
type: 'Org',
6970
owner: 'CoderToCat',
71+
orphan: false,
72+
});
73+
});
74+
75+
it('check orphan tag.', async () => {
76+
const instances: DescribeInstancesResult = mockRunningInstances;
77+
instances.Reservations![0].Instances![0].Tags!.push({ Key: 'ghr:orphan', Value: 'true' });
78+
mockEC2Client.on(DescribeInstancesCommand).resolves(instances);
79+
80+
const resp = await listEC2Runners();
81+
expect(resp.length).toBe(1);
82+
expect(resp).toContainEqual({
83+
instanceId: instances.Reservations![0].Instances![0].InstanceId!,
84+
launchTime: instances.Reservations![0].Instances![0].LaunchTime!,
85+
type: 'Org',
86+
owner: 'CoderToCat',
87+
orphan: true,
7088
});
7189
});
7290

@@ -114,6 +132,23 @@ describe('list instances', () => {
114132
});
115133
});
116134

135+
it('filters instances on environment and orphan', async () => {
136+
mockRunningInstances.Reservations![0].Instances![0].Tags!.push({
137+
Key: 'ghr:orphan',
138+
Value: 'true',
139+
});
140+
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstances);
141+
await listEC2Runners({ environment: ENVIRONMENT, orphan: true });
142+
expect(mockEC2Client).toHaveReceivedCommandWith(DescribeInstancesCommand, {
143+
Filters: [
144+
{ Name: 'instance-state-name', Values: ['running', 'pending'] },
145+
{ Name: 'tag:ghr:environment', Values: [ENVIRONMENT] },
146+
{ Name: 'tag:ghr:orphan', Values: ['true'] },
147+
{ Name: 'tag:ghr:Application', Values: ['github-action-runner'] },
148+
],
149+
});
150+
});
151+
117152
it('No instances, undefined reservations list.', async () => {
118153
const noInstances: DescribeInstancesResult = {
119154
Reservations: undefined,
@@ -182,6 +217,26 @@ describe('terminate runner', () => {
182217
});
183218
});
184219

220+
describe('tag runner', () => {
221+
beforeEach(() => {
222+
jest.clearAllMocks();
223+
});
224+
it('adding extra tag', async () => {
225+
mockEC2Client.on(CreateTagsCommand).resolves({});
226+
const runner: RunnerInfo = {
227+
instanceId: 'instance-2',
228+
owner: 'owner-2',
229+
type: 'Repo',
230+
};
231+
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'truer' }]);
232+
233+
expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
234+
Resources: [runner.instanceId],
235+
Tags: [{ Key: 'ghr:orphan', Value: 'truer' }],
236+
});
237+
});
238+
});
239+
185240
describe('create runner', () => {
186241
const defaultRunnerConfig: RunnerConfig = {
187242
allocationStrategy: SpotAllocationStrategy.CAPACITY_OPTIMIZED,

lambdas/functions/control-plane/src/aws/runners.ts

+14-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
import {
22
CreateFleetCommand,
33
CreateFleetResult,
4+
CreateTagsCommand,
45
DescribeInstancesCommand,
56
DescribeInstancesResult,
67
EC2Client,
78
FleetLaunchTemplateOverridesRequest,
9+
Tag,
810
TerminateInstancesCommand,
911
_InstanceType,
1012
} from '@aws-sdk/client-ec2';
@@ -46,6 +48,9 @@ function constructFilters(filters?: Runners.ListRunnerFilters): Ec2Filter[][] {
4648
ec2FiltersBase.push({ Name: `tag:ghr:Type`, Values: [filters.runnerType] });
4749
ec2FiltersBase.push({ Name: `tag:ghr:Owner`, Values: [filters.runnerOwner] });
4850
}
51+
if (filters.orphan) {
52+
ec2FiltersBase.push({ Name: 'tag:ghr:orphan', Values: ['true'] });
53+
}
4954
}
5055

5156
for (const key of ['tag:ghr:Application']) {
@@ -85,6 +90,7 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
8590
type: i.Tags?.find((e) => e.Key === 'ghr:Type')?.Value as string,
8691
repo: i.Tags?.find((e) => e.Key === 'ghr:Repo')?.Value as string,
8792
org: i.Tags?.find((e) => e.Key === 'ghr:Org')?.Value as string,
93+
orphan: i.Tags?.find((e) => e.Key === 'ghr:orphan')?.Value === 'true',
8894
});
8995
}
9096
}
@@ -94,10 +100,16 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
94100
}
95101

96102
export async function terminateRunner(instanceId: string): Promise<void> {
97-
logger.info(`Runner '${instanceId}' will be terminated.`);
103+
logger.debug(`Runner '${instanceId}' will be terminated.`);
98104
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
99105
await ec2.send(new TerminateInstancesCommand({ InstanceIds: [instanceId] }));
100-
logger.info(`Runner ${instanceId} has been terminated.`);
106+
logger.debug(`Runner ${instanceId} has been terminated.`);
107+
}
108+
109+
export async function tag(instanceId: string, tags: Tag[]): Promise<void> {
110+
logger.debug(`Tagging '${instanceId}'`, { tags });
111+
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
112+
await ec2.send(new CreateTagsCommand({ Resources: [instanceId], Tags: tags }));
101113
}
102114

103115
function generateFleetOverrides(

0 commit comments

Comments
 (0)