Merge pull request #7579 from tmorriso-rh/Trello-storage-prometheus-endpoint-coverage

Traci Morrison · web-flow · commit 6391aa459cda · 2018-02-13T09:59:06.000-05:00
Trello card: Added Kubernetes Storage Metrics via Prometheus section
diff --git a/install_config/cluster_metrics.adoc b/install_config/cluster_metrics.adoc
@@ -37,6 +37,10 @@ each node individually through the `/stats` endpoint. From there, Heapster
 scrapes the metrics for CPU, memory and network usage, then exports them into
 Hawkular Metrics.
 
+The storage volume metrics available on the kubelet are not available through
+the `/stats` endpoint, but are available through the `/metrics` endpoint. See
+{product-title} via Prometheus for detailed information.
+
 Browsing individual pods in the web console displays separate sparkline charts
 for memory and CPU. The time range displayed is selectable, and these charts
 automatically update every 30 seconds. If there are multiple containers on the
@@ -63,7 +67,7 @@ previous to v1.0.8, even if it has since been updated to a newer version, follow
 the instructions for node certificates outlined in
 xref:../install_config/upgrading/manual_upgrades.adoc#manual-updating-master-and-node-certificates[Updating
 Master and Node Certificates]. If the node certificate does not contain the IP
-address of the node, then Heapster will fail to retrieve any metrics.
+address of the node, then Heapster fails to retrieve any metrics.
 ====
 endif::[]
 
@@ -102,9 +106,9 @@ volume].
 === Persistent Storage
 
 Running {product-title} cluster metrics with persistent storage means that your
-metrics will be stored to a
+metrics are stored to a
 xref:../architecture/additional_concepts/storage.adoc#persistent-volumes[persistent
-volume] and be able to survive a pod being restarted or recreated. This is ideal
+volume] and are able to survive a pod being restarted or recreated. This is ideal
 if you require your metrics data to be guarded from data loss. For production
 environments it is highly recommended to configure persistent storage for your
 metrics pods.
@@ -205,7 +209,7 @@ storage space as a buffer for unexpected monitored pod usage.
 [WARNING]
 ====
 If the Cassandra persisted volume runs out of sufficient space, then data loss
-will occur.
+occurs.
 ====
 
 For cluster metrics to work with persistent storage, ensure that the persistent
@@ -245,7 +249,7 @@ metrics-gathering solutions.
 === Non-Persistent Storage
 
 Running {product-title} cluster metrics with non-persistent storage means that
-any stored metrics will be deleted when the pod is deleted. While it is much
+any stored metrics are deleted when the pod is deleted. While it is much
 easier to run cluster metrics with non-persistent data, running with
 non-persistent data does come with the risk of permanent data loss. However,
 metrics can still survive a container being restarted.
@@ -257,16 +261,16 @@ to `emptyDir` in the inventory file.
 
 [NOTE]
 ====
-When using non-persistent storage, metrics data will be written to
+When using non-persistent storage, metrics data is written to
 *_/var/lib/origin/openshift.local.volumes/pods_* on the node where the Cassandra
-pod is running. Ensure *_/var_* has enough free space to accommodate metrics
+pod runs Ensure *_/var_* has enough free space to accommodate metrics
 storage.
 ====
 
 [[metrics-ansible-role]]
 == Metrics Ansible Role
 
-The OpenShift Ansible `openshift_metrics` role configures and deploys all of the
+The {product-title} Ansible `openshift_metrics` role configures and deploys all of the
 metrics components using the variables from the
 xref:../install_config/install/advanced_install.adoc#configuring-ansible[Configuring
 Ansible] inventory file.
@@ -445,7 +449,7 @@ Technology Preview and is not installed by default.
 
 [NOTE]
 ====
-The Hawkular OpenShift Agent on {product-title} is a Technology Preview feature
+The Hawkular {product-title} Agent on {product-title} is a Technology Preview feature
 only.
 ifdef::openshift-enterprise[]
 Technology Preview features are not
@@ -479,7 +483,7 @@ that it does not become full.
 
 [WARNING]
 ====
-Data loss will result if the Cassandra persisted volume runs out of sufficient space.
+Data loss results if the Cassandra persisted volume runs out of sufficient space.
 ====
 
 All of the other variables are optional and allow for greater customization.
@@ -500,8 +504,8 @@ running.
 [[metrics-using-secrets]]
 === Using Secrets
 
-The OpenShift Ansible `openshift_metrics` role will auto-generate self-signed certificates for use between its
-components and will generate a
+The {product-title} Ansible `openshift_metrics` role auto-generates self-signed certificates for use between its
+components and generates a
 xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting route] to expose
 the Hawkular Metrics service. This route is what allows the web console to access the Hawkular Metrics
 service.
@@ -510,14 +514,14 @@ In order for the browser running the web console to trust the connection through
 this route, it must trust the route's certificate. This can be accomplished by
 xref:metrics-using-secrets-byo-certs[providing your own certificates] signed by
 a trusted Certificate Authority. The `openshift_metrics` role allows you to
-specify your own certificates which it will then use when creating the route.
+specify your own certificates, which it then uses when creating the route.
 
 The router's default certificate are used if you do not provide your own.
 
 [[metrics-using-secrets-byo-certs]]
 ==== Providing Your Own Certificates
 
-To provide your own certificate which will be used by the
+To provide your own certificate, which is used by the
 xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting
 route], you can set the `openshift_metrics_hawkular_cert`,
 `openshift_metrics_hawkular_key`, and `openshift_metrics_hawkular_ca`
@@ -536,7 +540,7 @@ route documentation].
 == Deploying the Metric Components
 
 Because deploying and configuring all the metric components is handled with
-OpenShift Ansible, you can deploy everything in one step.
+{product-title} Ansible, you can deploy everything in one step.
 
 The following examples show you how to deploy metrics with and without
 persistent storage using the default parameters.
@@ -619,8 +623,7 @@ For example, if your `openshift_metrics_hawkular_hostname` corresponds to
 Once you have updated and saved the *_master-config.yaml_* file, you must
 restart your {product-title} instance.
 
-When your {product-title} server is back up and running, metrics will be
-displayed on the pod overview pages.
+When your {product-title} server is back up and running, metrics are displayed on the pod overview pages.
 
 [CAUTION]
 ====
@@ -642,16 +645,16 @@ Metrics API].
 
 [NOTE]
 ====
-When accessing Hawkular Metrics from the API, you will only be able to perform
-reads. Writing metrics has been disabled by default. If you want for individual
+When accessing Hawkular Metrics from the API, you are only able to perform
+reads. Writing metrics is disabled by default. If you want individual
 users to also be able to write metrics, you must set the
 `openshift_metrics_hawkular_user_write_access`
 xref:../install_config/cluster_metrics.adoc#metrics-ansible-variables[variable]
 to *true*.
 
 However, it is recommended to use the default configuration and only have
 metrics enter the system via Heapster. If write access is enabled, any user
-will be able to write metrics to the system, which can affect performance and
+can write metrics to the system, which can affect performance and
 cause Cassandra disk usage to unpredictably increase.
 ====
 
@@ -676,7 +679,7 @@ privileges to access.
 [[cluster-metrics-authorization]]
 === Authorization
 
-The Hawkular Metrics service will authenticate the user against {product-title}
+The Hawkular Metrics service authenticates the user against {product-title}
 to determine if the user has access to the project it is trying to access.
 
 Hawkular Metrics accepts a bearer token from the client and verifies that token
@@ -692,8 +695,8 @@ ifdef::openshift-origin[]
 [[cluster-metrics-accessing-heapster-directly]]
 == Accessing Heapster Directly
 
-Heapster has been configured to be only accessible via the API proxy.
-Accessing it will required either a cluster-reader or cluster-admin privileges.
+Heapster is configured to only be accessible via the API proxy. Accessing
+Heapster requires either a cluster-reader or cluster-admin privileges.
 
 For example, to access the Heapster *validate* page, you need to access it
 using something similar to:
@@ -718,8 +721,8 @@ Performance Guide].
 == Integration with Aggregated Logging
 
 Hawkular Alerts must be connected to the Aggregated Logging's Elasticsearch to
-react on log events. By default, Hawkular will try to find Elasticsearch on its
-default place (namespace `logging`, pod `logging-es`) at every boot. If the
+react on log events. By default, Hawkular tries to find Elasticsearch on its
+default place (namespace `logging`, pod `logging-es`) at every boot. If
 Aggregated Logging is installed after Hawkular, the Hawkular Metrics pod might
 need to be restarted in order to recognize the new Elasticsearch server. The
 Hawkular boot log provides a clear indication if the integration could not be
@@ -754,7 +757,7 @@ available.
 [[metrics-cleanup]]
 == Cleanup
 
-You can remove everything deployed by the OpenShift Ansible `openshift_metrics` role
+You can remove everything deployed by the {product-title} Ansible `openshift_metrics` role
 by performing the following steps:
 
 ----
@@ -771,7 +774,7 @@ system resources.
 
 [IMPORTANT]
 ====
-Prometheus on OpenShift is a Technology Preview feature only.
+Prometheus on {product-title} is a Technology Preview feature only.
 ifdef::openshift-enterprise[]
 Technology Preview features are not supported with Red Hat production service
 level agreements (SLAs), might not be functionally complete, and Red Hat does
@@ -912,7 +915,7 @@ The Prometheus server automatically exposes a Web UI at `localhost:9090`. You
 can access the Prometheus Web UI with the `view` role.
 
 [[openshift-prometheus-config]]
-==== Configuring Prometheus for OpenShift
+==== Configuring Prometheus for {product-title} 
 //
 // Example Prometheus rules file:
 // ----
@@ -1031,6 +1034,118 @@ Once `openshift_metrics_project: openshift-infra` is installed, metrics can be
 gathered from the `http://${POD_IP}:7575/metrics` endpoint.
 ====
 
+[[openshift-prometheus-kubernetes-metrics]]
+=== {product-title} Metrics via Prometheus
+
+The state of a system can be gauged by the metrics that it emits. This section
+describes current and proposed metrics that identify the health of the storage subsystem and
+cluster.
+
+[[k8s-current-metrics]]
+==== Current Metrics
+
+This section describes the metrics currently emitted from Kubernetes’s storage subsystem.
+
+*Cloud Provider API Call Metrics*
+
+This metric reports the time and count of success and failures of all
+cloudprovider API calls. These metrics include `aws_attach_time` and
+`aws_detach_time`. The type of emitted metrics is a histogram, and hence,
+Prometheus also generates sum, count, and bucket metrics for these metrics.
+
+.Example summary of cloudprovider metrics from GCE:
+----
+cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
+cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
+cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
+cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
+cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
+cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
+----
+
+.Example summary of cloudprovider metrics from AWS:
+----
+cloudprovider_aws_api_request_duration_seconds { request = "attach_volume"}
+cloudprovider_aws_api_request_duration_seconds { request = "detach_volume"}
+cloudprovider_aws_api_request_duration_seconds { request = "create_tags"}
+cloudprovider_aws_api_request_duration_seconds { request = "create_volume"}
+cloudprovider_aws_api_request_duration_seconds { request = "delete_volume"}
+cloudprovider_aws_api_request_duration_seconds { request = "describe_instance"}
+cloudprovider_aws_api_request_duration_seconds { request = "describe_volume"}
+----
+
+See
+link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md[Cloud
+Provider (specifically GCE and AWS) metrics for Storage API calls] for more
+information.
+
+*Volume Operation Metrics*
+
+These metrics report time taken by a storage operation once started. These
+metrics keep track of operation time at the plug-in level, but do not include
+time taken by `goroutine` to run or operation to be picked up from the internal
+queue. These metrics are a type of histogram.
+
+.Example summary of available volume operation metrics
+----
+storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
+storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
+storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "volume_provision" }
+storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" }
+storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" }
+storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
+storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
+storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" }
+storage_operation_duration_seconds { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" }
+----
+
+See
+link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-metrics.md[Volume
+operation metrics] for more information.
+
+*Volume Stats Metrics*
+
+These metrics typically report usage stats of PVC (such as used space vs available space). The type of metrics emitted is gauge.
+
+.Volume Stats Metrics
+|===
+|Metric|Type|Labels/tags
+
+|volume_stats_capacityBytes
+|Gauge
+|namespace,persistentvolumeclaim,persistentvolume=
+
+|volume_stats_usedBytes
+|Gauge
+|namespace=<persistentvolumeclaim-namespace> 
+persistentvolumeclaim=<persistentvolumeclaim-name> 
+persistentvolume=<persistentvolume-name>
+
+|volume_stats_availableBytes
+|Gauge
+|namespace=<persistentvolumeclaim-namespace> 
+persistentvolumeclaim=<persistentvolumeclaim-name> 
+persistentvolume=
+
+|volume_stats_InodesFree
+|Gauge
+|namespace=<persistentvolumeclaim-namespace> 
+persistentvolumeclaim=<persistentvolumeclaim-name> 
+persistentvolume=<persistentvolume-name>
+
+|volume_stats_Inodes
+|Gauge
+|namespace=<persistentvolumeclaim-namespace> 
+persistentvolumeclaim=<persistentvolumeclaim-name> 
+persistentvolume=<persistentvolume-name>
+
+|volume_stats_InodesUsed
+|Gauge
+|namespace=<persistentvolumeclaim-namespace> 
+persistentvolumeclaim=<persistentvolumeclaim-name> 
+persistentvolume=<persistentvolume-name>
+|===
+
 [[openshift-prometheus-undeploy]]
 === Undeploying Prometheus