Skip to content

Commit d483d25

Browse files
hyeong01Heeyoung Jung
and
Heeyoung Jung
authored
Add/lifecycle heartbeat (#1116)
* add lifecycle heartbeat * Lifecycle heartbeat unit test * Refactor heartbeat logging statements * Heartbeat e2e test * Remove error handling for using heartbeat and imds together * add e2e test for lifecycle heartbeat * Add check heartbeat timeout and compare to heartbeat interval * Add error handling for using heartbeat and imds together * fix config error message * update error message for heartbeat config * Fix heartbeat flag explanation * Update readme for new heartbeat feature * Fix readme for heartbeat section * Update readme on the concurrency of heartbeat * fix: stop heartbeat when target is invalid * Added heartbeat test for handling invalid lifecycle action * incorporated unsupoorted error types for unit testing * fix unit-test: reset heartbeatCallCount each test * use helper function to reduce repetitive code in heartbeat unit test * Update readme. Moved heartbeat under Queue Processor * Fix config.go for better readability and check until < interval * Update heartbeat to have better logging * Update unit test to cover whole process of heartbeat start and closure * Update heartbeat e2e test. Auto-value calculations for future modification * Add inline comment for heartbeatUntil default behavior * Fixed e2e variables to have double quotes * fix readme for heartbeat * Added new flags in config test * Fixed typo in heartbeat e2e test --------- Co-authored-by: Heeyoung Jung <[email protected]>
1 parent 7d52be6 commit d483d25

File tree

9 files changed

+604
-18
lines changed

9 files changed

+604
-18
lines changed

Diff for: README.md

+71-2
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,75 @@ When using the ASG Lifecycle Hooks, ASG first sends the lifecycle action notific
8181
#### Queue Processor with Instance State Change Events
8282
When using the EC2 Console or EC2 API to terminate the instance, a state-change notification is sent and the instance termination is started. EC2 does not wait for a "continue" signal before beginning to terminate the instance. When you terminate an EC2 instance, it should trigger a graceful operating system shutdown which will send a SIGTERM to the kubelet, which will in-turn start shutting down pods by propagating that SIGTERM to the containers on the node. If the containers do not shut down by the kubelet's `podTerminationGracePeriod (k8s default is 30s)`, then it will send a SIGKILL to forcefully terminate the containers. Setting the `podTerminationGracePeriod` to a max of 90sec (probably a bit less than that) will delay the termination of pods, which helps in graceful shutdown.
8383

84+
#### Issuing Lifecycle Heartbeats
85+
86+
You can set NTH to send heartbeats to ASG in Queue Processor mode. This allows for a much longer grace period (up to 48 hours) for termination than the maximum heartbeat timeout of two hours. The feature is useful when pods require long time to drain or when you need a shorter heartbeat timeout with a longer grace period.
87+
88+
##### How it works
89+
90+
- When NTH receives an ASG lifecycle termination event, it starts sending heartbeats to ASG to renew the heartbeat timeout associated with the ASG's termination lifecycle hook.
91+
- The heartbeat timeout acts as a timer that starts when the termination event begins.
92+
- Before the timeout reaches zero, the termination process is halted at the `Terminating:Wait` stage.
93+
- By issuing heartbeats, graceful termination duration can be extended up to 48 hours, limited by the global timeout.
94+
95+
##### How to use
96+
97+
- Configure a termination lifecycle hook on ASG (required). Set the heartbeat timeout value to be longer than the `Heartbeat Interval`. Each heartbeat signal resets this timeout, extending the duration that an instance remains in the `Terminating:Wait` state. Without this lifecycle hook, the instance will terminate immediately when termination event occurs.
98+
- Configure `Heartbeat Interval` (required) and `Heartbeat Until` (optional). NTH operates normally without heartbeats if neither value is set. If only the interval is specified, `Heartbeat Until` defaults to 172800 seconds (48 hours) and heartbeats will be sent. `Heartbeat Until` must be provided with a valid `Heartbeat Interval`, otherwise NTH will fail to start. Any invalid values (wrong type or out of range) will also prevent NTH from starting.
99+
100+
##### Configurations
101+
###### `Heartbeat Interval` (Required)
102+
- Time period between consecutive heartbeat signals (in seconds)
103+
- Specifying this value triggers heartbeat
104+
- Range: 30 to 3600 seconds (30 seconds to 1 hour)
105+
- Flag for custom resource definition by *.yaml / helm: `heartbeatInterval`
106+
- CLI flag: `heartbeat-interval`
107+
- Default value: X
108+
109+
###### `Heartbeat Until` (Optional)
110+
- Duration over which heartbeat signals are sent (in seconds)
111+
- Must be provided with a valid `Heartbeat Interval`
112+
- Range: 60 to 172800 seconds (1 minute to 48 hours)
113+
- Flag for custom resource definition by *.yaml / helm: `heartbeatUntil`
114+
- CLI flag: `heartbeat-until`
115+
- Default value: 172800 (48 hours)
116+
117+
###### Example Case
118+
119+
- `Heartbeat Interval`: 1000 seconds
120+
- `Heartbeat Until`: 4500 seconds
121+
- `Heartbeat Timeout`: 3000 seconds
122+
123+
| Time (s) | Event | Heartbeat Timeout (HT) | Heartbeat Until (HU) | Action |
124+
|----------|-------------|------------------|----------------------|--------|
125+
| 0 | Start | 3000 | 4500 | Termination Event Received |
126+
| 1000 | HB1 Issued | 2000 -> 3000 | 3500 | Send Heartbeat |
127+
| 2000 | HB2 Issued | 2000 -> 3000 | 2500 | Send Heartbeat |
128+
| 3000 | HB3 Issued | 2000 -> 3000 | 1500 | Send Heartbeat |
129+
| 4000 | HB4 Issued | 2000 -> 3000 | 500 | Send Heartbeat |
130+
| 4500 | HB Expires | 2500 | 0 | Stop Heartbeats |
131+
| 7000 | Termination | - | - | Instance Terminates |
132+
133+
Note: The instance can terminate earlier if its pods finish draining and are ready for termination.
134+
135+
##### Example Helm Command
136+
137+
```sh
138+
helm upgrade --install aws-node-termination-handler \
139+
--namespace kube-system \
140+
--set enableSqsTerminationDraining=true \
141+
--set heartbeatInterval=1000 \
142+
--set heartbeatUntil=4500 \
143+
// other inputs..
144+
```
145+
146+
##### Important Notes
147+
148+
- Be aware of global timeout. Instances cannot remain in a wait state indefinitely. The global timeout is 48 hours or 100 times the heartbeat timeout, whichever is smaller. This is the maximum amount of time that you can keep an instance in `terminating:wait` state.
149+
- Lifecycle heartbeats are only supported in Queue Processor mode. Setting `enableSqsTerminationDraining=false` and specifying heartbeat flags is prevented in Helm. Directly editing deployment settings to bypass this will cause NTH to fail.
150+
- The heartbeat interval should be sufficiently shorter than the heartbeat timeout. There's a time gap between instance startup and NTH initialization. Setting the interval just slightly smaller than or equal to the timeout causes the heartbeat timeout to expire before the first heartbeat is issued. Provide adequate buffer time for NTH to complete initialization.
151+
- Issuing heartbeats is part of the termination process. The maximum number of instances that NTH can handle termination concurrently is limited by the number of workers. This implies that heartbeats can only be issued for up to the number of instances specified by the `workers` flag simultaneously.
152+
84153
### Which one should I use?
85154
| Feature | IMDS Processor | Queue Processor |
86155
| :-------------------------------------------: | :------------: | :-------------: |
@@ -91,6 +160,7 @@ When using the EC2 Console or EC2 API to terminate the instance, a state-change
91160
| ASG Termination Lifecycle State Change |||
92161
| AZ Rebalance Recommendation |||
93162
| Instance State Change Events |||
163+
| Issue Lifecycle Heartbeats |||
94164

95165
### Kubernetes Compatibility
96166

@@ -626,5 +696,4 @@ In IMDS mode, metrics can be collected as follows:
626696
Contributions are welcome! Please read our [guidelines](https://github.com/aws/aws-node-termination-handler/blob/main/CONTRIBUTING.md) and our [Code of Conduct](https://github.com/aws/aws-node-termination-handler/blob/main/CODE_OF_CONDUCT.md)
627697

628698
## License
629-
This project is licensed under the Apache-2.0 License.
630-
699+
This project is licensed under the Apache-2.0 License.

Diff for: config/helm/aws-node-termination-handler/templates/deployment.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,10 @@ spec:
168168
value: {{ .Values.deleteSqsMsgIfNodeNotFound | quote }}
169169
- name: WORKERS
170170
value: {{ .Values.workers | quote }}
171+
- name: HEARTBEAT_INTERVAL
172+
value: {{ .Values.heartbeatInterval | quote }}
173+
- name: HEARTBEAT_UNTIL
174+
value: {{ .Values.heartbeatUntil | quote }}
171175
{{- with .Values.extraEnv }}
172176
{{- toYaml . | nindent 12 }}
173177
{{- end }}

Diff for: pkg/config/config.go

+35-1
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,9 @@ const (
112112
queueURLConfigKey = "QUEUE_URL"
113113
completeLifecycleActionDelaySecondsKey = "COMPLETE_LIFECYCLE_ACTION_DELAY_SECONDS"
114114
deleteSqsMsgIfNodeNotFoundKey = "DELETE_SQS_MSG_IF_NODE_NOT_FOUND"
115+
// heartbeat
116+
heartbeatIntervalKey = "HEARTBEAT_INTERVAL"
117+
heartbeatUntilKey = "HEARTBEAT_UNTIL"
115118
)
116119

117120
// Config arguments set via CLI, environment variables, or defaults
@@ -166,6 +169,8 @@ type Config struct {
166169
CompleteLifecycleActionDelaySeconds int
167170
DeleteSqsMsgIfNodeNotFound bool
168171
UseAPIServerCacheToListPods bool
172+
HeartbeatInterval int
173+
HeartbeatUntil int
169174
}
170175

171176
// ParseCliArgs parses cli arguments and uses environment variables as fallback values
@@ -230,6 +235,8 @@ func ParseCliArgs() (config Config, err error) {
230235
flag.IntVar(&config.CompleteLifecycleActionDelaySeconds, "complete-lifecycle-action-delay-seconds", getIntEnv(completeLifecycleActionDelaySecondsKey, -1), "Delay completing the Autoscaling lifecycle action after a node has been drained.")
231236
flag.BoolVar(&config.DeleteSqsMsgIfNodeNotFound, "delete-sqs-msg-if-node-not-found", getBoolEnv(deleteSqsMsgIfNodeNotFoundKey, false), "If true, delete SQS Messages from the SQS Queue if the targeted node(s) are not found.")
232237
flag.BoolVar(&config.UseAPIServerCacheToListPods, "use-apiserver-cache", getBoolEnv(useAPIServerCache, false), "If true, leverage the k8s apiserver's index on pod's spec.nodeName to list pods on a node, instead of doing an etcd quorum read.")
238+
flag.IntVar(&config.HeartbeatInterval, "heartbeat-interval", getIntEnv(heartbeatIntervalKey, -1), "The time period in seconds between consecutive heartbeat signals. Valid range: 30-3600 seconds (30 seconds to 1 hour).")
239+
flag.IntVar(&config.HeartbeatUntil, "heartbeat-until", getIntEnv(heartbeatUntilKey, -1), "The duration in seconds over which heartbeat signals are sent. Valid range: 60-172800 seconds (1 minute to 48 hours).")
233240
flag.Parse()
234241

235242
if isConfigProvided("pod-termination-grace-period", podTerminationGracePeriodConfigKey) && isConfigProvided("grace-period", gracePeriodConfigKey) {
@@ -274,6 +281,27 @@ func ParseCliArgs() (config Config, err error) {
274281
panic("You must provide a node-name to the CLI or NODE_NAME environment variable.")
275282
}
276283

284+
// heartbeat value boundary and compability check
285+
if !config.EnableSQSTerminationDraining && (config.HeartbeatInterval != -1 || config.HeartbeatUntil != -1) {
286+
return config, fmt.Errorf("currently using IMDS mode. Heartbeat is only supported for Queue Processor mode")
287+
}
288+
if config.HeartbeatInterval != -1 && (config.HeartbeatInterval < 30 || config.HeartbeatInterval > 3600) {
289+
return config, fmt.Errorf("invalid heartbeat-interval passed: %d Should be between 30 and 3600 seconds", config.HeartbeatInterval)
290+
}
291+
if config.HeartbeatUntil != -1 && (config.HeartbeatUntil < 60 || config.HeartbeatUntil > 172800) {
292+
return config, fmt.Errorf("invalid heartbeat-until passed: %d Should be between 60 and 172800 seconds", config.HeartbeatUntil)
293+
}
294+
if config.HeartbeatInterval == -1 && config.HeartbeatUntil != -1 {
295+
return config, fmt.Errorf("invalid heartbeat configuration: heartbeat-interval is required when heartbeat-until is set")
296+
}
297+
if config.HeartbeatInterval != -1 && config.HeartbeatUntil == -1 {
298+
config.HeartbeatUntil = 172800
299+
log.Info().Msgf("Since heartbeat-until is not set, defaulting to %d seconds", config.HeartbeatUntil)
300+
}
301+
if config.HeartbeatInterval != -1 && config.HeartbeatUntil != -1 && config.HeartbeatInterval > config.HeartbeatUntil {
302+
return config, fmt.Errorf("invalid heartbeat configuration: heartbeat-interval should be less than or equal to heartbeat-until")
303+
}
304+
277305
// client-go expects these to be set in env vars
278306
os.Setenv(kubernetesServiceHostConfigKey, config.KubernetesServiceHost)
279307
os.Setenv(kubernetesServicePortConfigKey, config.KubernetesServicePort)
@@ -332,6 +360,8 @@ func (c Config) PrintJsonConfigArgs() {
332360
Str("ManagedTag", c.ManagedTag).
333361
Bool("use_provider_id", c.UseProviderId).
334362
Bool("use_apiserver_cache", c.UseAPIServerCacheToListPods).
363+
Int("heartbeat_interval", c.HeartbeatInterval).
364+
Int("heartbeat_until", c.HeartbeatUntil).
335365
Msg("aws-node-termination-handler arguments")
336366
}
337367

@@ -383,7 +413,9 @@ func (c Config) PrintHumanConfigArgs() {
383413
"\tmanaged-tag: %s,\n"+
384414
"\tuse-provider-id: %t,\n"+
385415
"\taws-endpoint: %s,\n"+
386-
"\tuse-apiserver-cache: %t,\n",
416+
"\tuse-apiserver-cache: %t,\n"+
417+
"\theartbeat-interval: %d,\n"+
418+
"\theartbeat-until: %d\n",
387419
c.DryRun,
388420
c.NodeName,
389421
c.PodName,
@@ -424,6 +456,8 @@ func (c Config) PrintHumanConfigArgs() {
424456
c.UseProviderId,
425457
c.AWSEndpoint,
426458
c.UseAPIServerCacheToListPods,
459+
c.HeartbeatInterval,
460+
c.HeartbeatUntil,
427461
)
428462
}
429463

Diff for: pkg/config/config_test.go

+19-4
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
3737
t.Setenv("ENABLE_SCHEDULED_EVENT_DRAINING", "true")
3838
t.Setenv("ENABLE_SPOT_INTERRUPTION_DRAINING", "false")
3939
t.Setenv("ENABLE_ASG_LIFECYCLE_DRAINING", "false")
40-
t.Setenv("ENABLE_SQS_TERMINATION_DRAINING", "false")
40+
t.Setenv("ENABLE_SQS_TERMINATION_DRAINING", "true")
4141
t.Setenv("ENABLE_REBALANCE_MONITORING", "true")
4242
t.Setenv("ENABLE_REBALANCE_DRAINING", "true")
4343
t.Setenv("GRACE_PERIOD", "12345")
@@ -54,6 +54,8 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
5454
t.Setenv("METADATA_TRIES", "100")
5555
t.Setenv("CORDON_ONLY", "false")
5656
t.Setenv("USE_APISERVER_CACHE", "true")
57+
t.Setenv("HEARTBEAT_INTERVAL", "30")
58+
t.Setenv("HEARTBEAT_UNTIL", "60")
5759
nthConfig, err := config.ParseCliArgs()
5860
h.Ok(t, err)
5961

@@ -64,7 +66,7 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
6466
h.Equals(t, true, nthConfig.EnableScheduledEventDraining)
6567
h.Equals(t, false, nthConfig.EnableSpotInterruptionDraining)
6668
h.Equals(t, false, nthConfig.EnableASGLifecycleDraining)
67-
h.Equals(t, false, nthConfig.EnableSQSTerminationDraining)
69+
h.Equals(t, true, nthConfig.EnableSQSTerminationDraining)
6870
h.Equals(t, true, nthConfig.EnableRebalanceMonitoring)
6971
h.Equals(t, true, nthConfig.EnableRebalanceDraining)
7072
h.Equals(t, false, nthConfig.IgnoreDaemonSets)
@@ -80,6 +82,8 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
8082
h.Equals(t, 100, nthConfig.MetadataTries)
8183
h.Equals(t, false, nthConfig.CordonOnly)
8284
h.Equals(t, true, nthConfig.UseAPIServerCacheToListPods)
85+
h.Equals(t, 30, nthConfig.HeartbeatInterval)
86+
h.Equals(t, 60, nthConfig.HeartbeatUntil)
8387

8488
// Check that env vars were set
8589
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")
@@ -101,7 +105,7 @@ func TestParseCliArgsSuccess(t *testing.T) {
101105
"--enable-scheduled-event-draining=true",
102106
"--enable-spot-interruption-draining=false",
103107
"--enable-asg-lifecycle-draining=false",
104-
"--enable-sqs-termination-draining=false",
108+
"--enable-sqs-termination-draining=true",
105109
"--enable-rebalance-monitoring=true",
106110
"--enable-rebalance-draining=true",
107111
"--ignore-daemon-sets=false",
@@ -117,6 +121,8 @@ func TestParseCliArgsSuccess(t *testing.T) {
117121
"--metadata-tries=100",
118122
"--cordon-only=false",
119123
"--use-apiserver-cache=true",
124+
"--heartbeat-interval=30",
125+
"--heartbeat-until=60",
120126
}
121127
nthConfig, err := config.ParseCliArgs()
122128
h.Ok(t, err)
@@ -128,7 +134,7 @@ func TestParseCliArgsSuccess(t *testing.T) {
128134
h.Equals(t, true, nthConfig.EnableScheduledEventDraining)
129135
h.Equals(t, false, nthConfig.EnableSpotInterruptionDraining)
130136
h.Equals(t, false, nthConfig.EnableASGLifecycleDraining)
131-
h.Equals(t, false, nthConfig.EnableSQSTerminationDraining)
137+
h.Equals(t, true, nthConfig.EnableSQSTerminationDraining)
132138
h.Equals(t, true, nthConfig.EnableRebalanceMonitoring)
133139
h.Equals(t, true, nthConfig.EnableRebalanceDraining)
134140
h.Equals(t, false, nthConfig.IgnoreDaemonSets)
@@ -145,6 +151,8 @@ func TestParseCliArgsSuccess(t *testing.T) {
145151
h.Equals(t, false, nthConfig.CordonOnly)
146152
h.Equals(t, false, nthConfig.EnablePrometheus)
147153
h.Equals(t, true, nthConfig.UseAPIServerCacheToListPods)
154+
h.Equals(t, 30, nthConfig.HeartbeatInterval)
155+
h.Equals(t, 60, nthConfig.HeartbeatUntil)
148156

149157
// Check that env vars were set
150158
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")
@@ -176,6 +184,9 @@ func TestParseCliArgsOverrides(t *testing.T) {
176184
t.Setenv("WEBHOOK_TEMPLATE", "no")
177185
t.Setenv("METADATA_TRIES", "100")
178186
t.Setenv("CORDON_ONLY", "true")
187+
t.Setenv("HEARTBEAT_INTERVAL", "3601")
188+
t.Setenv("HEARTBEAT_UNTIL", "172801")
189+
179190
os.Args = []string{
180191
"cmd",
181192
"--use-provider-id=false",
@@ -201,6 +212,8 @@ func TestParseCliArgsOverrides(t *testing.T) {
201212
"--cordon-only=false",
202213
"--enable-prometheus-server=true",
203214
"--prometheus-server-port=2112",
215+
"--heartbeat-interval=3600",
216+
"--heartbeat-until=172800",
204217
}
205218
nthConfig, err := config.ParseCliArgs()
206219
h.Ok(t, err)
@@ -229,6 +242,8 @@ func TestParseCliArgsOverrides(t *testing.T) {
229242
h.Equals(t, false, nthConfig.CordonOnly)
230243
h.Equals(t, true, nthConfig.EnablePrometheus)
231244
h.Equals(t, 2112, nthConfig.PrometheusPort)
245+
h.Equals(t, 3600, nthConfig.HeartbeatInterval)
246+
h.Equals(t, 172800, nthConfig.HeartbeatUntil)
232247

233248
// Check that env vars were set
234249
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")

0 commit comments

Comments
 (0)