Skip to content

Commit 6391aa4

Browse files
author
Traci Morrison
authored
Merge pull request #7579 from tmorriso-rh/Trello-storage-prometheus-endpoint-coverage
Trello card: Added Kubernetes Storage Metrics via Prometheus section
2 parents 513488c + ea071dc commit 6391aa4

File tree

1 file changed

+143
-28
lines changed

1 file changed

+143
-28
lines changed

install_config/cluster_metrics.adoc

Lines changed: 143 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ each node individually through the `/stats` endpoint. From there, Heapster
3737
scrapes the metrics for CPU, memory and network usage, then exports them into
3838
Hawkular Metrics.
3939

40+
The storage volume metrics available on the kubelet are not available through
41+
the `/stats` endpoint, but are available through the `/metrics` endpoint. See
42+
{product-title} via Prometheus for detailed information.
43+
4044
Browsing individual pods in the web console displays separate sparkline charts
4145
for memory and CPU. The time range displayed is selectable, and these charts
4246
automatically update every 30 seconds. If there are multiple containers on the
@@ -63,7 +67,7 @@ previous to v1.0.8, even if it has since been updated to a newer version, follow
6367
the instructions for node certificates outlined in
6468
xref:../install_config/upgrading/manual_upgrades.adoc#manual-updating-master-and-node-certificates[Updating
6569
Master and Node Certificates]. If the node certificate does not contain the IP
66-
address of the node, then Heapster will fail to retrieve any metrics.
70+
address of the node, then Heapster fails to retrieve any metrics.
6771
====
6872
endif::[]
6973

@@ -102,9 +106,9 @@ volume].
102106
=== Persistent Storage
103107

104108
Running {product-title} cluster metrics with persistent storage means that your
105-
metrics will be stored to a
109+
metrics are stored to a
106110
xref:../architecture/additional_concepts/storage.adoc#persistent-volumes[persistent
107-
volume] and be able to survive a pod being restarted or recreated. This is ideal
111+
volume] and are able to survive a pod being restarted or recreated. This is ideal
108112
if you require your metrics data to be guarded from data loss. For production
109113
environments it is highly recommended to configure persistent storage for your
110114
metrics pods.
@@ -205,7 +209,7 @@ storage space as a buffer for unexpected monitored pod usage.
205209
[WARNING]
206210
====
207211
If the Cassandra persisted volume runs out of sufficient space, then data loss
208-
will occur.
212+
occurs.
209213
====
210214

211215
For cluster metrics to work with persistent storage, ensure that the persistent
@@ -245,7 +249,7 @@ metrics-gathering solutions.
245249
=== Non-Persistent Storage
246250

247251
Running {product-title} cluster metrics with non-persistent storage means that
248-
any stored metrics will be deleted when the pod is deleted. While it is much
252+
any stored metrics are deleted when the pod is deleted. While it is much
249253
easier to run cluster metrics with non-persistent data, running with
250254
non-persistent data does come with the risk of permanent data loss. However,
251255
metrics can still survive a container being restarted.
@@ -257,16 +261,16 @@ to `emptyDir` in the inventory file.
257261

258262
[NOTE]
259263
====
260-
When using non-persistent storage, metrics data will be written to
264+
When using non-persistent storage, metrics data is written to
261265
*_/var/lib/origin/openshift.local.volumes/pods_* on the node where the Cassandra
262-
pod is running. Ensure *_/var_* has enough free space to accommodate metrics
266+
pod runs Ensure *_/var_* has enough free space to accommodate metrics
263267
storage.
264268
====
265269

266270
[[metrics-ansible-role]]
267271
== Metrics Ansible Role
268272

269-
The OpenShift Ansible `openshift_metrics` role configures and deploys all of the
273+
The {product-title} Ansible `openshift_metrics` role configures and deploys all of the
270274
metrics components using the variables from the
271275
xref:../install_config/install/advanced_install.adoc#configuring-ansible[Configuring
272276
Ansible] inventory file.
@@ -445,7 +449,7 @@ Technology Preview and is not installed by default.
445449

446450
[NOTE]
447451
====
448-
The Hawkular OpenShift Agent on {product-title} is a Technology Preview feature
452+
The Hawkular {product-title} Agent on {product-title} is a Technology Preview feature
449453
only.
450454
ifdef::openshift-enterprise[]
451455
Technology Preview features are not
@@ -479,7 +483,7 @@ that it does not become full.
479483

480484
[WARNING]
481485
====
482-
Data loss will result if the Cassandra persisted volume runs out of sufficient space.
486+
Data loss results if the Cassandra persisted volume runs out of sufficient space.
483487
====
484488

485489
All of the other variables are optional and allow for greater customization.
@@ -500,8 +504,8 @@ running.
500504
[[metrics-using-secrets]]
501505
=== Using Secrets
502506

503-
The OpenShift Ansible `openshift_metrics` role will auto-generate self-signed certificates for use between its
504-
components and will generate a
507+
The {product-title} Ansible `openshift_metrics` role auto-generates self-signed certificates for use between its
508+
components and generates a
505509
xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting route] to expose
506510
the Hawkular Metrics service. This route is what allows the web console to access the Hawkular Metrics
507511
service.
@@ -510,14 +514,14 @@ In order for the browser running the web console to trust the connection through
510514
this route, it must trust the route's certificate. This can be accomplished by
511515
xref:metrics-using-secrets-byo-certs[providing your own certificates] signed by
512516
a trusted Certificate Authority. The `openshift_metrics` role allows you to
513-
specify your own certificates which it will then use when creating the route.
517+
specify your own certificates, which it then uses when creating the route.
514518

515519
The router's default certificate are used if you do not provide your own.
516520

517521
[[metrics-using-secrets-byo-certs]]
518522
==== Providing Your Own Certificates
519523

520-
To provide your own certificate which will be used by the
524+
To provide your own certificate, which is used by the
521525
xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting
522526
route], you can set the `openshift_metrics_hawkular_cert`,
523527
`openshift_metrics_hawkular_key`, and `openshift_metrics_hawkular_ca`
@@ -536,7 +540,7 @@ route documentation].
536540
== Deploying the Metric Components
537541

538542
Because deploying and configuring all the metric components is handled with
539-
OpenShift Ansible, you can deploy everything in one step.
543+
{product-title} Ansible, you can deploy everything in one step.
540544

541545
The following examples show you how to deploy metrics with and without
542546
persistent storage using the default parameters.
@@ -619,8 +623,7 @@ For example, if your `openshift_metrics_hawkular_hostname` corresponds to
619623
Once you have updated and saved the *_master-config.yaml_* file, you must
620624
restart your {product-title} instance.
621625

622-
When your {product-title} server is back up and running, metrics will be
623-
displayed on the pod overview pages.
626+
When your {product-title} server is back up and running, metrics are displayed on the pod overview pages.
624627

625628
[CAUTION]
626629
====
@@ -642,16 +645,16 @@ Metrics API].
642645

643646
[NOTE]
644647
====
645-
When accessing Hawkular Metrics from the API, you will only be able to perform
646-
reads. Writing metrics has been disabled by default. If you want for individual
648+
When accessing Hawkular Metrics from the API, you are only able to perform
649+
reads. Writing metrics is disabled by default. If you want individual
647650
users to also be able to write metrics, you must set the
648651
`openshift_metrics_hawkular_user_write_access`
649652
xref:../install_config/cluster_metrics.adoc#metrics-ansible-variables[variable]
650653
to *true*.
651654
652655
However, it is recommended to use the default configuration and only have
653656
metrics enter the system via Heapster. If write access is enabled, any user
654-
will be able to write metrics to the system, which can affect performance and
657+
can write metrics to the system, which can affect performance and
655658
cause Cassandra disk usage to unpredictably increase.
656659
====
657660

@@ -676,7 +679,7 @@ privileges to access.
676679
[[cluster-metrics-authorization]]
677680
=== Authorization
678681

679-
The Hawkular Metrics service will authenticate the user against {product-title}
682+
The Hawkular Metrics service authenticates the user against {product-title}
680683
to determine if the user has access to the project it is trying to access.
681684

682685
Hawkular Metrics accepts a bearer token from the client and verifies that token
@@ -692,8 +695,8 @@ ifdef::openshift-origin[]
692695
[[cluster-metrics-accessing-heapster-directly]]
693696
== Accessing Heapster Directly
694697

695-
Heapster has been configured to be only accessible via the API proxy.
696-
Accessing it will required either a cluster-reader or cluster-admin privileges.
698+
Heapster is configured to only be accessible via the API proxy. Accessing
699+
Heapster requires either a cluster-reader or cluster-admin privileges.
697700

698701
For example, to access the Heapster *validate* page, you need to access it
699702
using something similar to:
@@ -718,8 +721,8 @@ Performance Guide].
718721
== Integration with Aggregated Logging
719722

720723
Hawkular Alerts must be connected to the Aggregated Logging's Elasticsearch to
721-
react on log events. By default, Hawkular will try to find Elasticsearch on its
722-
default place (namespace `logging`, pod `logging-es`) at every boot. If the
724+
react on log events. By default, Hawkular tries to find Elasticsearch on its
725+
default place (namespace `logging`, pod `logging-es`) at every boot. If
723726
Aggregated Logging is installed after Hawkular, the Hawkular Metrics pod might
724727
need to be restarted in order to recognize the new Elasticsearch server. The
725728
Hawkular boot log provides a clear indication if the integration could not be
@@ -754,7 +757,7 @@ available.
754757
[[metrics-cleanup]]
755758
== Cleanup
756759

757-
You can remove everything deployed by the OpenShift Ansible `openshift_metrics` role
760+
You can remove everything deployed by the {product-title} Ansible `openshift_metrics` role
758761
by performing the following steps:
759762

760763
----
@@ -771,7 +774,7 @@ system resources.
771774

772775
[IMPORTANT]
773776
====
774-
Prometheus on OpenShift is a Technology Preview feature only.
777+
Prometheus on {product-title} is a Technology Preview feature only.
775778
ifdef::openshift-enterprise[]
776779
Technology Preview features are not supported with Red Hat production service
777780
level agreements (SLAs), might not be functionally complete, and Red Hat does
@@ -912,7 +915,7 @@ The Prometheus server automatically exposes a Web UI at `localhost:9090`. You
912915
can access the Prometheus Web UI with the `view` role.
913916

914917
[[openshift-prometheus-config]]
915-
==== Configuring Prometheus for OpenShift
918+
==== Configuring Prometheus for {product-title}
916919
//
917920
// Example Prometheus rules file:
918921
// ----
@@ -1031,6 +1034,118 @@ Once `openshift_metrics_project: openshift-infra` is installed, metrics can be
10311034
gathered from the `http://${POD_IP}:7575/metrics` endpoint.
10321035
====
10331036

1037+
[[openshift-prometheus-kubernetes-metrics]]
1038+
=== {product-title} Metrics via Prometheus
1039+
1040+
The state of a system can be gauged by the metrics that it emits. This section
1041+
describes current and proposed metrics that identify the health of the storage subsystem and
1042+
cluster.
1043+
1044+
[[k8s-current-metrics]]
1045+
==== Current Metrics
1046+
1047+
This section describes the metrics currently emitted from Kubernetes’s storage subsystem.
1048+
1049+
*Cloud Provider API Call Metrics*
1050+
1051+
This metric reports the time and count of success and failures of all
1052+
cloudprovider API calls. These metrics include `aws_attach_time` and
1053+
`aws_detach_time`. The type of emitted metrics is a histogram, and hence,
1054+
Prometheus also generates sum, count, and bucket metrics for these metrics.
1055+
1056+
.Example summary of cloudprovider metrics from GCE:
1057+
----
1058+
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
1059+
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
1060+
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
1061+
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
1062+
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
1063+
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
1064+
----
1065+
1066+
.Example summary of cloudprovider metrics from AWS:
1067+
----
1068+
cloudprovider_aws_api_request_duration_seconds { request = "attach_volume"}
1069+
cloudprovider_aws_api_request_duration_seconds { request = "detach_volume"}
1070+
cloudprovider_aws_api_request_duration_seconds { request = "create_tags"}
1071+
cloudprovider_aws_api_request_duration_seconds { request = "create_volume"}
1072+
cloudprovider_aws_api_request_duration_seconds { request = "delete_volume"}
1073+
cloudprovider_aws_api_request_duration_seconds { request = "describe_instance"}
1074+
cloudprovider_aws_api_request_duration_seconds { request = "describe_volume"}
1075+
----
1076+
1077+
See
1078+
link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md[Cloud
1079+
Provider (specifically GCE and AWS) metrics for Storage API calls] for more
1080+
information.
1081+
1082+
*Volume Operation Metrics*
1083+
1084+
These metrics report time taken by a storage operation once started. These
1085+
metrics keep track of operation time at the plug-in level, but do not include
1086+
time taken by `goroutine` to run or operation to be picked up from the internal
1087+
queue. These metrics are a type of histogram.
1088+
1089+
.Example summary of available volume operation metrics
1090+
----
1091+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
1092+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
1093+
storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "volume_provision" }
1094+
storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" }
1095+
storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" }
1096+
storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
1097+
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
1098+
storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" }
1099+
storage_operation_duration_seconds { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" }
1100+
----
1101+
1102+
See
1103+
link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-metrics.md[Volume
1104+
operation metrics] for more information.
1105+
1106+
*Volume Stats Metrics*
1107+
1108+
These metrics typically report usage stats of PVC (such as used space vs available space). The type of metrics emitted is gauge.
1109+
1110+
.Volume Stats Metrics
1111+
|===
1112+
|Metric|Type|Labels/tags
1113+
1114+
|volume_stats_capacityBytes
1115+
|Gauge
1116+
|namespace,persistentvolumeclaim,persistentvolume=
1117+
1118+
|volume_stats_usedBytes
1119+
|Gauge
1120+
|namespace=<persistentvolumeclaim-namespace>
1121+
persistentvolumeclaim=<persistentvolumeclaim-name>
1122+
persistentvolume=<persistentvolume-name>
1123+
1124+
|volume_stats_availableBytes
1125+
|Gauge
1126+
|namespace=<persistentvolumeclaim-namespace>
1127+
persistentvolumeclaim=<persistentvolumeclaim-name>
1128+
persistentvolume=
1129+
1130+
|volume_stats_InodesFree
1131+
|Gauge
1132+
|namespace=<persistentvolumeclaim-namespace>
1133+
persistentvolumeclaim=<persistentvolumeclaim-name>
1134+
persistentvolume=<persistentvolume-name>
1135+
1136+
|volume_stats_Inodes
1137+
|Gauge
1138+
|namespace=<persistentvolumeclaim-namespace>
1139+
persistentvolumeclaim=<persistentvolumeclaim-name>
1140+
persistentvolume=<persistentvolume-name>
1141+
1142+
|volume_stats_InodesUsed
1143+
|Gauge
1144+
|namespace=<persistentvolumeclaim-namespace>
1145+
persistentvolumeclaim=<persistentvolumeclaim-name>
1146+
persistentvolume=<persistentvolume-name>
1147+
|===
1148+
10341149
[[openshift-prometheus-undeploy]]
10351150
=== Undeploying Prometheus
10361151

0 commit comments

Comments
 (0)