Skip to content

Commit 724d653

Browse files
authored
Merge pull request #809 from gnufied/high-level-volume-metrics
Add a proposal for high level volume metrics
2 parents d8c0889 + 4a80e58 commit 724d653

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Volume operation metrics
2+
3+
## Goal
4+
5+
Capture high level metrics for various volume operations in Kubernetes.
6+
7+
## Motivation
8+
9+
Currently we don't have high level metrics that captures time taken
10+
and success/failures rates of various volume operations.
11+
12+
This proposal aims to implement capturing of these metrics at a level
13+
higher than individual volume plugins.
14+
15+
## Implementation
16+
17+
### Metric format and collection
18+
19+
Volume metrics emitted will fall under category of service metrics
20+
as defined in [Kubernetes Monitoring Architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md).
21+
22+
23+
The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection
24+
from `/metrics` HTTP endpoint of kubelet and controller-manager.
25+
26+
27+
Any collector which can parse Prometheus metric format should be able to collect
28+
metrics from these endpoints.
29+
30+
A more detailed description of monitoring pipeline can be found in [Monitoring architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md#monitoring-pipeline) document.
31+
32+
### Metric Types
33+
34+
Since we are interested in count(or rate) and time it takes to perform certain volume operation - we will use [Histogram](https://prometheus.io/docs/practices/histograms/) type for
35+
emitting these metrics.
36+
37+
We will be using `HistogramVec` type so as we can attach dimensions at runtime. All
38+
the volume operation metrics will be named `storage_operation_duration_seconds`.
39+
Name of operation and volume plugin's name will be emitted as dimensions. If for some reason
40+
volume plugin's name is not available when operation is performed - label's value can be set
41+
to `<n/a>`.
42+
43+
44+
We are also interested in count of volume operation failures and hence a metric of type `NewCounterVec`
45+
will be used for keeping track of errors. The error metric will be similarly named `storage_operation_errors_total`.
46+
47+
Following is a sample of metrics (not exhaustive) that will be added by this proposal:
48+
49+
50+
```
51+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
52+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
53+
storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "provision" }
54+
storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" }
55+
storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" }
56+
storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
57+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "mount_device" }
58+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
59+
storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volume" }
60+
```
61+
62+
Similarly errors will be named:
63+
64+
```
65+
storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
66+
storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
67+
storage_operation_errors_total { volume_plugin = "glusterfs", operation_name = "provision" }
68+
storage_operation_errors_total { volume_plugin = "gce-pd", operation_name = "volume_delete" }
69+
storage_operation_errors_total { volume_plugin = "vsphere", operation_name = "volume_mount" }
70+
storage_operation_errors_total { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
71+
storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "mount_device" }
72+
storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
73+
storage_operation_errors_total { volume_plugin = "cinder" , operation_name = "verify_volume" }
74+
```
75+
76+
### Implementation Detail
77+
78+
We propose following changes as part of implementation details.
79+
80+
1. All volume operations are executed via `goroutinemap.Run` or `nestedpendingoperations.Run`.
81+
`Run` function interface of these two types can be changed to include a `operationComplete` callback argument.
82+
83+
For example:
84+
85+
```go
86+
// nestedpendingoperations.go
87+
Run(v1.UniqueVolumeName, types.UniquePodName, func() error, opComplete func(error)) error
88+
// goroutinemap
89+
Run(string, func() error, opComplete func(error)) error
90+
```
91+
92+
This will enable us to know when a volume operation is complete.
93+
94+
2. All `GenXXX` functions in `operation_generator.go` should return plugin name in addition to function and error.
95+
96+
for example:
97+
98+
```go
99+
GenerateMountVolumeFunc(waitForAttachTimeout time.Duration,
100+
volumeToMount VolumeToMount,
101+
actualStateOfWorldMounterUpdater
102+
ActualStateOfWorldMounterUpdater, isRemount bool) (func() error, pluginName string, err error)
103+
```
104+
105+
Similarly `pv_controller.scheduleOperation` will take plugin name as additional parameter:
106+
107+
```go
108+
func (ctrl *PersistentVolumeController) scheduleOperation(
109+
operationName string,
110+
pluginName string,
111+
operation func() error)
112+
```
113+
114+
3. Above changes will enable us to gather required metrics in `operation_executor` or when scheduling a operation in
115+
pv controller.
116+
117+
For example, metrics for time it takes to attach Volume can be captured via:
118+
119+
```go
120+
func operationExecutorHook(plugin, operationName string) func(error) {
121+
requestTime := time.Now()
122+
opComplete := func(err error) {
123+
timeTaken := time.Since(requestTime).Seconds()
124+
// Create metric with operation name and plugin name
125+
}
126+
return onComplete
127+
}
128+
attachFunc, plugin, err :=
129+
oe.operationGenerator.GenerateAttachVolumeFunc(volumeToAttach, actualStateOfWorld)
130+
opCompleteFunc := operationExecutorHook(plugin, "volume_attach")
131+
return oe.pendingOperations.Run(
132+
volumeToAttach.VolumeName, "" /* podName */, attachFunc, opCompleteFunc)
133+
```
134+
135+
`operationExecutorHook` function is a hook that is registered in operation_executor and it will
136+
initialize necessary metric params and will return a function. This will will be called when
137+
operation is complete and will finalize metric creation and finally emit the metrics.
138+
139+
### Conclusion
140+
141+
Collection of metrics at operation level ensures almost no code change to volume plugin interface and a very minimum change to controllers.

0 commit comments

Comments
 (0)