Skip to content

Commit 9607ca6

Browse files
npalmphilips-labs-pr|botmpasstuartp44
authored
feat: support AWS EventBridge (#4188)
## Description This PR introduces the AWS EventBridge. The EventBridge can be enabled with the options `webhook_mode`, which can be set to either `direct` or `eventbridge`. In the direct mode the olds way of handling is still applied. When setting the mode to `eventbridge` events will publshed on the AWS EventBridge, which is not limited only to the event `workflow_job` with statues `queued` via a target rule events relevant for scaling a snet to the dispatcher lambda to distrute to a SQS queue for sacling. ## Todo - [x] Refactor lambda and add EventBridge - [x] Refactor webhook module (TF) to support EventBridge - [x] Test example default - [x] Test example multi runner - [x] Adjust docs - [x] Reduce permissions on webhook and dispatcher lambda for eventbridge mode - [x] Add configuration for allowed events on the EventBridge - [ ] Add support for CMK (encruption) to EventBridge #4192 ## MIgration directions The change is backwards compatible but will recreate resources managed by the internal module webhook. The only resource contianing data is the CloudWatch LogGroup. To retain the log geroup you can run a terraform state move. Or add a `moved` block to your deployemnt. ### Migrating to this version With module defaults or eventbridge is not eanavbled ```hcl # log group moved { from = module.<runner-module-name>.module.webhook.aws_cloudwatch_log_group.webhook to = module.<runner-module-name>.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook } # lambda moved { from = module.<runner-module-name>.module.webhook.aws_lambda_function.webhook to = module.<runner-module-name>.module.webhook.module.direct[0].aws_lambda_function.webhook } ``` Or with `webhook_mode = eventbridge` ```hcl # log group moved { from = module.<runner-module-name>.module.webhook.aws_cloudwatch_log_group.webhook to = module.<runner-module-name>.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook } # lambda moved { from = module.<runner-module-name>.module.webhook.aws_lambda_function.webhook to = module.<runner-module-name>.module.webhook.module.direct[0].aws_lambda_function.webhook } ``` ### When switching between direct and eventbridge When enable mode `eventbridge` ```hcl # log group moved { from = module.runners.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook to = module.runners.module.webhook.module.eventbridge[0].aws_cloudwatch_log_group.webhook } # lambda moved { from = module.runners.module.webhook.module.direct[0].aws_lambda_function.webhook to = module.runners.module.webhook.module.eventbridge[0].aws_lambda_function.webhook } ``` Or vice versa for moving from `eventbride` to `webhook` --------- Co-authored-by: philips-labs-pr|bot <philips-labs-pr[bot]@users.noreply.github.com> Co-authored-by: Marco Pas <[email protected]> Co-authored-by: Stuart Pearson <[email protected]>
1 parent 556f00b commit 9607ca6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+2540
-482
lines changed

Diff for: .terraform.lock.hcl

+21-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,8 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
156156
| <a name="input_enable_ssm_on_runners"></a> [enable\_ssm\_on\_runners](#input\_enable\_ssm\_on\_runners) | Enable to allow access to the runner instances for debugging purposes via SSM. Note that this adds additional permissions to the runner instances. | `bool` | `false` | no |
157157
| <a name="input_enable_user_data_debug_logging_runner"></a> [enable\_user\_data\_debug\_logging\_runner](#input\_enable\_user\_data\_debug\_logging\_runner) | Option to enable debug logging for user-data, this logs all secrets as well. | `bool` | `false` | no |
158158
| <a name="input_enable_userdata"></a> [enable\_userdata](#input\_enable\_userdata) | Should the userdata script be enabled for the runner. Set this to false if you are using your own prebuilt AMI. | `bool` | `true` | no |
159-
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |
159+
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondary SQS queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |
160+
| <a name="input_eventbridge"></a> [eventbridge](#input\_eventbridge) | Enable the use of EventBridge by the module. By enabling this feature events will be put on the EventBridge by the webhook instead of directly dispatching to queues for scaling.<br/><br/> `enable`: Enable the EventBridge feature.<br/> `accept_events`: List can be used to only allow specific events to be putted on the EventBridge. By default all events, empty list will be be interpreted as all events. | <pre>object({<br/> enable = optional(bool, false)<br/> accept_events = optional(list(string), null)<br/> })</pre> | `{}` | no |
160161
| <a name="input_ghes_ssl_verify"></a> [ghes\_ssl\_verify](#input\_ghes\_ssl\_verify) | GitHub Enterprise SSL verification. Set to 'false' when custom certificate (chains) is used for GitHub Enterprise Server (insecure). | `bool` | `true` | no |
161162
| <a name="input_ghes_url"></a> [ghes\_url](#input\_ghes\_url) | GitHub Enterprise Server URL. Example: https://github.internal.co - DO NOT SET IF USING PUBLIC GITHUB | `string` | `null` | no |
162163
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub app parameters, see your github app. Ensure the key is the base64-encoded `.pem` file (the output of `base64 app.private-key.pem`, not the content of `private-key.pem`). | <pre>object({<br/> key_base64 = string<br/> id = string<br/> webhook_secret = string<br/> })</pre> | n/a | yes |

Diff for: docs/assets/aws-architecture.dark.png

103 KB
Loading

Diff for: docs/assets/aws-architecture.light.png

98.5 KB
Loading

Diff for: docs/configuration.md

+76-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ To be able to support a number of use-cases, the module has quite a lot of confi
66

77
- Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org, or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
88
- Multi-Runner module. This modules allows you to create multiple runner configurations with a single webhook and single GitHub App to simplify deployment of different types of runners. Check the detailed module [documentation](modules/public/multi-runner.md) for more information or checkout the [multi-runner example](examples/multi-runner.md).
9-
- Workflow job event. You can configure the webhook in GitHub to send workflow job events to the webhook. Workflow job events were introduced by GitHub in September 2021 and are designed to support scalable runners. We advise using the workflow job event when possible.
9+
- Webhook mode, the module can be deployed in `direct` mode or `EventBridge` (Experimental) mode. The `direct` mode is the default and will directly distribute to SQS for the scale-up lambda. The `EventBridge` mode will publish the events to a eventbus, the rule then directs the received events to a dispatch lambda. The dispatch lambda will send the event to the SQS queue. The `EventBridge` mode is useful when you want to have more control over the events and potentially filter them. The `EventBridge` mode is disabled by default. An example of what the `EventBridge` mode could be used for is building a data lake, build metrics, act on `workflow_job` job started events, etc.
1010
- Linux vs Windows. You can configure the OS types linux and win. Linux will be used by default.
1111
- Re-use vs Ephemeral. By default runners are re-used, until detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners only work in combination with the workflow job event. For ephemeral runners the lambda requests a JIT (just in time) configuration via the GitHub API to register the runner. [JIT configuration](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners) is limited to ephemeral runners (and currently not supported by GHES). For non-ephemeral runners, a registration token is always requested. In both cases the configuration is made available to the instance via the same SSM parameter. To disable JIT configuration for ephemeral runners set `enable_jit_config` to `false`. We also suggest using a pre-build AMI to improve the start time of jobs for ephemeral runners.
1212
- Job retry (**Beta**). By default the scale-up lambda will discard the message when it is handled. Meaning in the ephemeral use-case an instance is created. The created runner will ask GitHub for a job, no guarantee it will run the job for which it was scaling. Result could be that with small system hick-up the job is keeping waiting for a runner. Enable a pool (org runners) is one option to avoid this problem. Another option is to enable the job retry function. Which will retry the job after a delay for a configured number of times.
@@ -259,8 +259,83 @@ Below an example of the the log messages created.
259259
}
260260
```
261261

262+
### EventBridge
263+
264+
This module can be deployed in using the mode `EventBridge` (Experimental). The `EventBridge` mode will publish an event to a eventbus. Within the eventbus, there is a target rule set, sending events to the dispatch lambda. The `EventBridge` mode is disabled by default.
265+
266+
Example to use the EventBridge:
267+
268+
```hcl
269+
270+
module "runners" {
271+
source = "philips-labs/github-runners/aws"
272+
273+
...
274+
eventbridge = {
275+
enable = true
276+
}
277+
...
278+
}
279+
280+
locals {
281+
event_bus_name = module.runners.webhook.eventbridge.event_bus.name
282+
}
283+
284+
resource "aws_cloudwatch_event_rule" "example" {
285+
name = "${local.prefix}-github-events-all"
286+
description = "Caputure all GitHub events"
287+
event_bus_name = local.event_bus_name
288+
event_pattern = <<EOF
289+
{
290+
"source": [{
291+
"prefix": "github"
292+
}]
293+
}
294+
EOF
295+
}
296+
297+
resource "aws_cloudwatch_event_target" "main" {
298+
rule = aws_cloudwatch_event_rule.example.name
299+
arn = <arn of target>
300+
event_bus_name = local.event_bus_name
301+
role_arn = aws_iam_role.event_rule_firehose_role.arn
302+
}
303+
304+
data "aws_iam_policy_document" "event_rule_firehose_role" {
305+
statement {
306+
actions = ["sts:AssumeRole"]
307+
308+
principals {
309+
type = "Service"
310+
identifiers = ["events.amazonaws.com"]
311+
}
312+
}
313+
}
314+
315+
resource "aws_iam_role" "event_rule_role" {
316+
name = "${local.prefix}-eventbridge-github-rule"
317+
assume_role_policy = data.aws_iam_policy_document.event_rule_firehose_role.json
318+
}
319+
320+
data aws_iam_policy_document firehose_stream {
321+
statement {
322+
INSER_YOUR_POIICY_HERE_TO_ACCESS_THE_TARGET
323+
}
324+
}
325+
326+
resource "aws_iam_role_policy" "event_rule_firehose_role" {
327+
name = "target-event-rule-firehose"
328+
role = aws_iam_role.event_rule_firehose_role.name
329+
policy = data.aws_iam_policy_document.firehose_stream.json
330+
}
331+
```
332+
262333
### Queue to publish workflow job events
263334

335+
!!! warning "Deprecated
336+
337+
This fearure will be removed since we introducing the EventBridge. Same functinallity can be implemented by adding a rule to the EventBridge to forward `workflow_job` events to the SQS queue.
338+
264339
This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GitHub App. This can be used to calculate a matrix or monitor the system.
265340

266341
To enable the feature set `enable_workflow_job_events_queue = true`. Be aware though, this feature is experimental!

Diff for: docs/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The diagram below shows the architecture of the module, groups are indicating th
3131

3232
### Webhook
3333

34-
The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. This module reacts to GitHub's [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) for the triggered workflow and creates a new runner if necessary.
34+
The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. The module can be deployed in two modes. One mode called `direct`, after accepting the [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) event the module will dispatch the event to a SQS queue on which the scale-up function will act. The second mode, `eventbridge` will funnel events via the AWS EventBridge. the EventBridge enables act on other events then only the `workflow_job` event with status `queued`. besides that the EventBridge supports replay functionality. For future extensions to act on events or create a data lake we will relay on the EventBridge.
3535

3636
For receiving the `workflow_job` event by the webhook (lambda), a webhook needs to be created in GitHub. The same app as for API calls can be used to create the webhook. Or a dedicated webhook can be defined.
3737

Diff for: examples/default/main.tf

+14-1
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,21 @@ module "runners" {
9797
# prefix GitHub runners with the environment name
9898
runner_name_prefix = "${local.environment}_"
9999

100+
# webhook supports two modes, either direct or via the eventbridge, uncomment to enable eventbridge
101+
# eventbridge = {
102+
# enable = true
103+
# # adjust the allow events to only allow specific events, like workflow_job
104+
# # allowed_events = ['workflow_job']
105+
# }
106+
100107
# Enable debug logging for the lambda functions
101-
log_level = "info"
108+
# log_level = "debug"
109+
110+
# tracing_config = {
111+
# mode = "Active"
112+
# capture_error = true
113+
# capture_http_requests = true
114+
# }
102115

103116
enable_ami_housekeeper = true
104117
ami_housekeeper_cleanup_config = {

Diff for: examples/multi-runner/main.tf

+8
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,14 @@ module "runners" {
7777
id = var.github_app.id
7878
webhook_secret = random_id.random.hex
7979
}
80+
81+
# Deploy webhook using the EventBridge
82+
eventbridge = {
83+
enable = true
84+
# adjust the allow events to only allow specific events, like workflow_job
85+
accept_events = ["workflow_job"]
86+
}
87+
8088
# enable this section for tracing
8189
# tracing_config = {
8290
# mode = "Active"

Diff for: lambdas/functions/ami-housekeeper/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"@typescript-eslint/eslint-plugin": "^8.9.0",
2424
"@typescript-eslint/parser": "^8.11.0",
2525
"@vercel/ncc": "^0.38.1",
26-
"aws-sdk-client-mock": "^4.0.2",
26+
"aws-sdk-client-mock": "^4.1.0",
2727
"aws-sdk-client-mock-jest": "^4.1.0",
2828
"eslint": "^8.57.0",
2929
"eslint-plugin-prettier": "5.2.1",

Diff for: lambdas/functions/control-plane/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"@typescript-eslint/eslint-plugin": "^8.9.0",
2424
"@typescript-eslint/parser": "^8.11.0",
2525
"@vercel/ncc": "^0.38.1",
26-
"aws-sdk-client-mock": "^4.0.2",
26+
"aws-sdk-client-mock": "^4.1.0",
2727
"aws-sdk-client-mock-jest": "^4.1.0",
2828
"eslint": "^8.57.0",
2929
"eslint-plugin-prettier": "5.2.1",

Diff for: lambdas/functions/gh-agent-syncer/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
"@typescript-eslint/eslint-plugin": "^8.9.0",
2525
"@typescript-eslint/parser": "^8.11.0",
2626
"@vercel/ncc": "^0.38.1",
27-
"aws-sdk-client-mock": "^4.0.2",
27+
"aws-sdk-client-mock": "^4.1.0",
2828
"aws-sdk-client-mock-jest": "^4.1.0",
2929
"eslint": "^8.57.0",
3030
"eslint-plugin-prettier": "5.2.1",

Diff for: lambdas/functions/termination-watcher/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
"@typescript-eslint/eslint-plugin": "^8.9.0",
2222
"@typescript-eslint/parser": "^8.11.0",
2323
"@vercel/ncc": "^0.38.1",
24-
"aws-sdk-client-mock": "^4.0.2",
24+
"aws-sdk-client-mock": "^4.1.0",
2525
"aws-sdk-client-mock-jest": "^4.1.0",
2626
"eslint": "^8.57.0",
2727
"eslint-plugin-prettier": "5.2.1",

Diff for: lambdas/functions/webhook/jest.config.ts

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ const config: Config = {
66
...defaultConfig,
77
coverageThreshold: {
88
global: {
9-
statements: 99.2,
9+
statements: 99.58,
1010
branches: 100,
1111
functions: 100,
12-
lines: 99.25,
12+
lines: 99.57,
1313
},
1414
},
1515
};

Diff for: lambdas/functions/webhook/package.json

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
"all": "yarn build && yarn format && yarn lint && yarn test"
1717
},
1818
"devDependencies": {
19+
"@aws-sdk/client-eventbridge": "^3.670.0",
1920
"@trivago/prettier-plugin-sort-imports": "^4.3.0",
2021
"@types/aws-lambda": "^8.10.145",
2122
"@types/express": "^4.17.21",

0 commit comments

Comments
 (0)