You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-instrumentation/647-apiserver-tracing/README.md
+29-5Lines changed: 29 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -71,7 +71,24 @@ Along with metrics and logs, traces are a useful form of telemetry to aid with d
71
71
72
72
We will wrap the API Server's http server and http clients with [otelhttp](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/net/http/otelhttp) to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below [WithRequestInfo](https://github.com/kubernetes/kubernetes/blob/9eb097c4b07ea59c674a69e19c1519f0d10f2fa8/staging/src/k8s.io/apiserver/pkg/server/config.go#L676) in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.
73
73
74
-
Note that some clients of the API Server, such as webhooks, may make reentrant calls to the API Server. To gain the full benefit of tracing, such clients should propagate context with requests back to the API Server.
74
+
Note that some clients of the API Server, such as webhooks, may make reentrant calls to the API Server. To gain the full benefit of tracing, such clients should propagate context with requests back to the API Server. One way to do this is to use the wrap the webhook's http server using otelhttp, and use the request's context when making requests to the API Server.
75
+
76
+
**Webhook Example**
77
+
78
+
Wrapping the http server, which ensures context is propagated from http headers to the requests context:
Note: Even though the admission controller uses the otelhttp handler wrapper, that does _not_ mean it will emit spans. OpenTelemetry has a concept of an SDK, which manages the exporting of telemetry. If no SDK is registered, the NoOp SDK is used, which only propagates context, and does not export spans. In the webhook case in which no SDK is registered, the reentrant API call would appear to be a direct child of the original API call. If the webhook registers an SDK and exports spans, there would be an additional span from the webhook between the original and reentrant API Server call.
90
+
91
+
Note: OpenTelemetry has a concept of ["Baggage"](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md#baggage-api), which is akin to annotations for propagated context. If there is any additional metadata we would like to attach, and propagate along with a request, we can do that using Baggage.
75
92
76
93
### Exporting Spans
77
94
@@ -81,7 +98,7 @@ The API Server will use the [OpenTelemetry exporter format](https://github.com/o
81
98
82
99
### Running the OpenTelemetry Collector
83
100
84
-
The [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) can be run as a sidecar, a daemonset, a deployment , or a combination in which the daemonset buffers telemetry and forwards to the deployment for aggregation (e.g. tail-base sampling) and routing to a telemetry backend. To support these various setups, the API Server should be able to send traffic either to a local (on the master) collector, or to a cluster service (in the cluster).
101
+
The [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) can be run as a sidecar, a daemonset, a deployment , or a combination in which the daemonset buffers telemetry and forwards to the deployment for aggregation (e.g. tail-base sampling) and routing to a telemetry backend. To support these various setups, the API Server should be able to send traffic either to a local (on the control plane network) collector, or to a cluster service (on the cluster network).
85
102
86
103
### APIServer Configuration and EgressSelectors
87
104
@@ -96,12 +113,12 @@ type OpenTelemetryClientConfiguration struct {
96
113
97
114
// +optional
98
115
// URL of the collector that's running on the master.
99
-
// if URL is specified, APIServer uses the egressType Master when sending tracing data to the collector.
116
+
// if URL is specified, APIServer uses the egressType Master when sending data to the collector.
// Service that's the frontend of the collector deployment running in the cluster.
104
-
// If Service is specified, APIServer uses the egressType Cluster when sending tracing data to the collector.
121
+
// If Service is specified, APIServer uses the egressType Cluster when sending data to the collector.
105
122
Service *ServiceReference `json:"service,omitempty" protobuf:"bytes,1,opt,name=service"`
106
123
}
107
124
@@ -122,6 +139,8 @@ type ServiceReference struct {
122
139
}
123
140
```
124
141
142
+
If `--opentelemetry-config-file` is not specified, the API Server will not send any telemetry.
143
+
125
144
### Controlling use of the OpenTelemetry library
126
145
127
146
As the community found in the [Metrics Stability Framework KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20190404-kubernetes-control-plane-metrics-stability.md#kubernetes-control-plane-metrics-stability), having control over how the client libraries are used in kubernetes can enable maintainers to enforce policy and make broad improvements to the quality of telemetry. To enable future improvements to tracing, we will restrict the direct use of the OpenTelemetry library within the kubernetes code base, and provide wrapped versions of functions we wish to expose in a utility library.
@@ -143,6 +162,11 @@ Beta
143
162
-[] OpenTelemetry reaches GA
144
163
-[] Publish examples of how to use the OT Collector with kubernetes
145
164
-[] Allow time for feedback
165
+
-[] Revisit the format used to export spans.
166
+
167
+
GA
168
+
169
+
-[] Tracing e2e tests are promoted to conformance tests
146
170
147
171
## Production Readiness Survey
148
172
@@ -199,7 +223,7 @@ Beta
199
223
- What are the known failure modes? **The API Server is misconfigured, and cannot talk to the collector. The collector is misconfigured, and can't send traces to the backend.**
200
224
- How can those be detected via metrics or logs? Logs from the component or agent based on the failure mode.
201
225
- What are the mitigations for each of those failure modes? **None. You must correctly configure the collector for tracing to work.**
202
-
- What are the most useful log messages and what logging levels do they require? **All errors are useful, and are logged as errors (no logging levels required). Failure to initialize exporters (in both controller and collector), failures exporting metrics are the most useful.**
226
+
- What are the most useful log messages and what logging levels do they require? **All errors are useful, and are logged as errors (no logging levels required). Failure to initialize exporters (in both controller and collector), failures exporting metrics are the most useful. Errors are logged for each failed attempt to establish a connection to the collector.**
203
227
- What steps should be taken if SLOs are not being met to determine the
204
228
problem? **Look at API Server and collector logs.**
0 commit comments