Skip to content

Commit 80b13b5

Browse files
committed
[docs] Update documentation to cover some recent improvements
1 parent 3dfe309 commit 80b13b5

File tree

7 files changed

+117
-11
lines changed

7 files changed

+117
-11
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Check our [quick-start](https://nightlies.apache.org/flink/flink-kubernetes-oper
1717
- Upgrade, suspend and delete deployments
1818
- Full logging and metrics integration
1919
- Flexible deployments and native integration with Kubernetes tooling
20+
- Flink Job Autoscaler
2021

2122
For the complete feature-set please refer to our [documentation](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/).
2223

Diff for: docs/content/docs/concepts/overview.md

+7-3
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Flink Kubernetes Operator aims to capture the responsibilities of a human operat
3636
- Stateful and stateless application upgrades
3737
- Triggering and managing savepoints
3838
- Handling errors, rolling-back broken upgrades
39-
- Multiple Flink version support: v1.13, v1.14, v1.15, v1.16
39+
- Multiple Flink version support: v1.13, v1.14, v1.15, v1.16, v1.17
4040
- [Deployment Modes]({{< ref "docs/custom-resource/overview#application-deployments" >}}):
4141
- Application cluster
4242
- Session cluster
@@ -52,6 +52,10 @@ Flink Kubernetes Operator aims to capture the responsibilities of a human operat
5252
- POD augmentation via [Pod Templates]({{< ref "docs/custom-resource/pod-template" >}})
5353
- Native Kubernetes POD definitions
5454
- Layering (Base/JobManager/TaskManager overrides)
55+
- [Job Autoscaler]({{< ref "docs/custom-resource/autoscaler" >}})
56+
- Collect lag and utilization metrics
57+
- Scale job vertices to the ideal parallelism
58+
- Scale up and down as the load changes
5559
### Operations
5660
- Operator [Metrics]({{< ref "docs/operations/metrics-logging#metrics" >}})
5761
- Utilizes the well-established [Flink Metric System](https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics)
@@ -101,5 +105,5 @@ drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543
101105
```
102106

103107
### AuditUtils can log sensitive information present in the custom resources
104-
As reported in [FLINK-30306](https://issues.apache.org/jira/browse/FLINK-30306) when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this.
105-
Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.
108+
As reported in [FLINK-30306](https://issues.apache.org/jira/browse/FLINK-30306) when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this.
109+
Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.

Diff for: docs/content/docs/custom-resource/job-management.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ The `upgradeMode` setting controls both the stop and restore mechanisms as detai
9898
The three upgrade modes are intended to support different scenarios:
9999

100100
1. **stateless**: Stateless application upgrades from empty state
101-
2. **last-state**: Quick upgrades in any application state (even for failing jobs), does not require a healthy job as it always uses the latest checkpoint information. Manual recovery may be necessary if HA metadata is lost.
101+
2. **last-state**: Quick upgrades in any application state (even for failing jobs), does not require a healthy job as it always uses the latest checkpoint information. Manual recovery may be necessary if HA metadata is lost. To limit the time the job may fall back when picking up the latest checkpoint you can configure `kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age`. If the checkpoint is older than the configured value a savepoint will be taken instead for healthy jobs.
102102
3. **savepoint**: Use savepoint for upgrade, providing maximal safety and possibility to serve as backup/fork point. The savepoint will be created during the upgrade process. Note that the Flink job needs to be running to allow the savepoint to get created. If the job is in an unhealthy state, the last checkpoint will be used (unless `kubernetes.operator.job.upgrade.last-state-fallback.enabled` is set to `false`). If the last checkpoint is not available, the job upgrade will fail.
103103

104104
During stateful upgrades there are always cases which might require user intervention to preserve the consistency of the application. Please see the [manual Recovery section](#manual-recovery) for details.
@@ -214,6 +214,9 @@ Savepoint cleanup happens lazily and only when the application is running.
214214
It is therefore very likely that savepoints live beyond the max age configuration.
215215
{{< /hint >}}
216216
217+
To disable savepoint cleanup by the operator you can set `kubernetes.operator.savepoint.cleanup.enabled: false`.
218+
When savepoint cleanup is disabled the operator will still collect and populate the savepoint history but not perform any dispose operations.
219+
217220
## Recovery of missing job deployments
218221
219222
When HA is enabled, the operator can recover the Flink cluster deployments in cases when it was accidentally deleted

Diff for: docs/content/docs/custom-resource/overview.md

+4-5
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Most deployments will define at least the following fields:
8787
- `image` : Docker used to run Flink job and task manager processes
8888
- `flinkVersion` : Flink version used in the image (`v1_13`, `v1_14`, `v1_15`, `v1_16` ...)
8989
- `serviceAccount` : Kubernetes service account used by the Flink pods
90-
- `taskManager, jobManager` : Job and Task manager pod resource specs (cpu, memory, etc.)
90+
- `taskManager, jobManager` : Job and Task manager pod resource specs (cpu, memory, ephemeralStorage)
9191
- `flinkConfiguration` : Map of Flink configuration overrides such as HA and checkpointing configs
9292
- `job` : Job Spec for Application deployments
9393

@@ -158,7 +158,7 @@ For standard Operator use running your own Flink Jobs Native mode is recommended
158158

159159
Standalone cluster deployment simply uses Kubernetes as an orchestration platform that the Flink cluster is running on. Flink is unaware that it is running on Kubernetes and therefore all Kubernetes resources need to be managed externally, by the Kubernetes Operator.
160160

161-
In Standalone mode the Flink cluster doesn't have access to the Kubernetes cluster so this can increase security. If unknown or external code is being ran on the Flink cluster then Standalone mode adds another layer of security.
161+
In Standalone mode the Flink cluster doesn't have access to the Kubernetes cluster so this can increase security. If unknown or external code is being ran on the Flink cluster then Standalone mode adds another layer of security.
162162

163163
The deployment mode can be set using the `mode` field in the deployment spec.
164164

@@ -169,7 +169,7 @@ kind: FlinkDeployment
169169
spec:
170170
...
171171
mode: standalone
172-
172+
173173
174174
```
175175

@@ -212,12 +212,11 @@ COPY flink-hadoop-fs-1.15-SNAPSHOT.jar $FLINK_PLUGINS_DIR/hadoop-fs/
212212

213213
### Limitations
214214

215-
- The LastState UpgradeMode have not been supported.
215+
- Last-state upgradeMode is currently not supported for FlinkSessionJobs
216216

217217
## Further information
218218

219219
- [Job Management and Stateful upgrades]({{< ref "docs/custom-resource/job-management" >}})
220220
- [Deployment customization and pod templates]({{< ref "docs/custom-resource/pod-template" >}})
221221
- [Full Reference]({{< ref "docs/custom-resource/reference" >}})
222222
- [Examples](https://github.com/apache/flink-kubernetes-operator/tree/main/examples)
223-

Diff for: docs/content/docs/custom-resource/pod-template.md

+30
Original file line numberDiff line numberDiff line change
@@ -104,3 +104,33 @@ spec:
104104
When using the operator with Flink native Kubernetes integration, please refer to [pod template field precedence](
105105
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#fields-overwritten-by-flink).
106106
{{< /hint >}}
107+
108+
## Array Merging Behaviour
109+
110+
When layering pod templates (defining both a top level and jobmanager specific podtemplate for example) the corresponding yamls are merged together.
111+
112+
The default behaviour of the pod template mechanism is to merge array arrays by merging the objects in the respective array positions.
113+
This requires that containers in the podTemplates are defined in the same order otherwise the results may be undefined.
114+
115+
Default behaviour (merge by position):
116+
117+
```
118+
arr1: [{name: a, p1: v1}, {name: b, p1: v1}]
119+
arr1: [{name: a, p2: v2}, {name: c, p2: v2}]
120+
121+
merged: [{name: a, p1: v1, p2: v2}, {name: c, p1: v1, p2: v2}]
122+
```
123+
124+
The operator supports an alternative array merging mechanism that can be enabled by the `kubernetes.operator.pod-template.merge-arrays-by-name` flag.
125+
When true, instead of the default positional merging, object array elements that have a `name` property defined will be merged by their name and the resulting array will be a union of the two input arrays.
126+
127+
Merge by name:
128+
129+
```
130+
arr1: [{name: a, p1: v1}, {name: b, p1: v1}]
131+
arr1: [{name: a, p2: v2}, {name: c, p2: v2}]
132+
133+
merged: [{name: a, p1: v1, p2: v2}, {name: b, p1: v1}, {name: c, p2: v2}]
134+
```
135+
136+
Merging by name can we be very convenient when merging container specs or when the base and override templates are not defined together.

Diff for: docs/content/docs/development/roadmap.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,6 @@ It's not a comprehensive list and might be slightly outdated at any given time.
3131

3232
## What’s Next?
3333

34-
- Standalone deployment mode support [FLIP-225](https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator)
35-
- Improved scaling and autoscaling support
3634
- Improved rollback mechanism and stability conditions
35+
- Autoscaler hardening and improvements
36+
- Support for in-place job rescaling with Flink 1.18

Diff for: docs/content/docs/operations/health.md

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: "Operator Health Monitoring"
3+
weight: 3
4+
type: docs
5+
aliases:
6+
- /operations/health.html
7+
---
8+
<!--
9+
Licensed to the Apache Software Foundation (ASF) under one
10+
or more contributor license agreements. See the NOTICE file
11+
distributed with this work for additional information
12+
regarding copyright ownership. The ASF licenses this file
13+
to you under the Apache License, Version 2.0 (the
14+
"License"); you may not use this file except in compliance
15+
with the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing,
20+
software distributed under the License is distributed on an
21+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
22+
KIND, either express or implied. See the License for the
23+
specific language governing permissions and limitations
24+
under the License.
25+
-->
26+
27+
# Operator Health Monitoring
28+
29+
## Health Probe
30+
31+
The Flink Kubernetes Operator provides a built in health endpoint that serves as the information source for Kubernetes liveness and startup probes.
32+
33+
The liveness and startup probes are enabled by default in the Helm chart:
34+
35+
```
36+
operatorHealth:
37+
port: 8085
38+
livenessProbe:
39+
periodSeconds: 10
40+
initialDelaySeconds: 30
41+
startupProbe:
42+
failureThreshold: 30
43+
periodSeconds: 10
44+
```
45+
46+
The health endpoint catches startup and informer errors that are exposed by the JOSDK framework. By default if one of the watched namespaces becomes inaccessible the health endpoint will report an error and the operator will restart.
47+
48+
In some cases it is desirable to keep the operator running even if some namespaces are inaccessible. To allow the operator to start even if some namespaces cannot be watched, you can disable the `kubernetes.operator.startup.stop-on-informer-error` flag.
49+
50+
## Canary Resources
51+
52+
The canary resource feature allows users to deploy special dummy resources (canaries) into selected namespaces. The operator health probe will then monitor that these resources are reconciled in a timely manner. This allows the operator health probe to catch any slowdowns, and other general reconciliation issues not covered otherwise.
53+
54+
Canary deployments are identified by a special label: `"flink.apache.org/canary": "true"`. These resources do not need to define a spec and they will not start any pods or consume other cluster resources and are purely there to assert the operator reconciliation functionality.
55+
56+
Canary FlinkDeployment:
57+
58+
```
59+
apiVersion: flink.apache.org/v1beta1
60+
kind: FlinkDeployment
61+
metadata:
62+
name: canary
63+
labels:
64+
"flink.apache.org/canary": "true"
65+
```
66+
67+
The default timeout for reconciling the canary resources is 1 minute and it is controlled by `kubernetes.operator.health.canary.resource.timeout`. If the operator cannot reconcile the canaries within this time limit the operator is marked unhealthy and will be automatically restarted.
68+
69+
Canaries can be deployed into multiple namespaces.

0 commit comments

Comments
 (0)