Skip to content

Trello card: Added Kubernetes Storage Metrics via Prometheus section #7579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
1 commit merged into from Feb 13, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 143 additions & 28 deletions install_config/cluster_metrics.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ each node individually through the `/stats` endpoint. From there, Heapster
scrapes the metrics for CPU, memory and network usage, then exports them into
Hawkular Metrics.

The storage volume metrics available on the kubelet are not available through
the `/stats` endpoint, but are available through the `/metrics` endpoint. See
{product-title} via Prometheus for detailed information.

Browsing individual pods in the web console displays separate sparkline charts
for memory and CPU. The time range displayed is selectable, and these charts
automatically update every 30 seconds. If there are multiple containers on the
Expand All @@ -63,7 +67,7 @@ previous to v1.0.8, even if it has since been updated to a newer version, follow
the instructions for node certificates outlined in
xref:../install_config/upgrading/manual_upgrades.adoc#manual-updating-master-and-node-certificates[Updating
Master and Node Certificates]. If the node certificate does not contain the IP
address of the node, then Heapster will fail to retrieve any metrics.
address of the node, then Heapster fails to retrieve any metrics.
====
endif::[]

Expand Down Expand Up @@ -102,9 +106,9 @@ volume].
=== Persistent Storage

Running {product-title} cluster metrics with persistent storage means that your
metrics will be stored to a
metrics are stored to a
xref:../architecture/additional_concepts/storage.adoc#persistent-volumes[persistent
volume] and be able to survive a pod being restarted or recreated. This is ideal
volume] and are able to survive a pod being restarted or recreated. This is ideal
if you require your metrics data to be guarded from data loss. For production
environments it is highly recommended to configure persistent storage for your
metrics pods.
Expand Down Expand Up @@ -205,7 +209,7 @@ storage space as a buffer for unexpected monitored pod usage.
[WARNING]
====
If the Cassandra persisted volume runs out of sufficient space, then data loss
will occur.
occurs.
====

For cluster metrics to work with persistent storage, ensure that the persistent
Expand Down Expand Up @@ -245,7 +249,7 @@ metrics-gathering solutions.
=== Non-Persistent Storage

Running {product-title} cluster metrics with non-persistent storage means that
any stored metrics will be deleted when the pod is deleted. While it is much
any stored metrics are deleted when the pod is deleted. While it is much
easier to run cluster metrics with non-persistent data, running with
non-persistent data does come with the risk of permanent data loss. However,
metrics can still survive a container being restarted.
Expand All @@ -257,16 +261,16 @@ to `emptyDir` in the inventory file.

[NOTE]
====
When using non-persistent storage, metrics data will be written to
When using non-persistent storage, metrics data is written to
*_/var/lib/origin/openshift.local.volumes/pods_* on the node where the Cassandra
pod is running. Ensure *_/var_* has enough free space to accommodate metrics
pod runs Ensure *_/var_* has enough free space to accommodate metrics
storage.
====

[[metrics-ansible-role]]
== Metrics Ansible Role

The OpenShift Ansible `openshift_metrics` role configures and deploys all of the
The {product-title} Ansible `openshift_metrics` role configures and deploys all of the
metrics components using the variables from the
xref:../install_config/install/advanced_install.adoc#configuring-ansible[Configuring
Ansible] inventory file.
Expand Down Expand Up @@ -445,7 +449,7 @@ Technology Preview and is not installed by default.

[NOTE]
====
The Hawkular OpenShift Agent on {product-title} is a Technology Preview feature
The Hawkular {product-title} Agent on {product-title} is a Technology Preview feature
only.
ifdef::openshift-enterprise[]
Technology Preview features are not
Expand Down Expand Up @@ -479,7 +483,7 @@ that it does not become full.

[WARNING]
====
Data loss will result if the Cassandra persisted volume runs out of sufficient space.
Data loss results if the Cassandra persisted volume runs out of sufficient space.
====

All of the other variables are optional and allow for greater customization.
Expand All @@ -500,8 +504,8 @@ running.
[[metrics-using-secrets]]
=== Using Secrets

The OpenShift Ansible `openshift_metrics` role will auto-generate self-signed certificates for use between its
components and will generate a
The {product-title} Ansible `openshift_metrics` role auto-generates self-signed certificates for use between its
components and generates a
xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting route] to expose
the Hawkular Metrics service. This route is what allows the web console to access the Hawkular Metrics
service.
Expand All @@ -510,14 +514,14 @@ In order for the browser running the web console to trust the connection through
this route, it must trust the route's certificate. This can be accomplished by
xref:metrics-using-secrets-byo-certs[providing your own certificates] signed by
a trusted Certificate Authority. The `openshift_metrics` role allows you to
specify your own certificates which it will then use when creating the route.
specify your own certificates, which it then uses when creating the route.

The router's default certificate are used if you do not provide your own.

[[metrics-using-secrets-byo-certs]]
==== Providing Your Own Certificates

To provide your own certificate which will be used by the
To provide your own certificate, which is used by the
xref:../architecture/networking/routes.adoc#secured-routes[re-encrypting
route], you can set the `openshift_metrics_hawkular_cert`,
`openshift_metrics_hawkular_key`, and `openshift_metrics_hawkular_ca`
Expand All @@ -536,7 +540,7 @@ route documentation].
== Deploying the Metric Components

Because deploying and configuring all the metric components is handled with
OpenShift Ansible, you can deploy everything in one step.
{product-title} Ansible, you can deploy everything in one step.

The following examples show you how to deploy metrics with and without
persistent storage using the default parameters.
Expand Down Expand Up @@ -619,8 +623,7 @@ For example, if your `openshift_metrics_hawkular_hostname` corresponds to
Once you have updated and saved the *_master-config.yaml_* file, you must
restart your {product-title} instance.

When your {product-title} server is back up and running, metrics will be
displayed on the pod overview pages.
When your {product-title} server is back up and running, metrics are displayed on the pod overview pages.

[CAUTION]
====
Expand All @@ -642,16 +645,16 @@ Metrics API].

[NOTE]
====
When accessing Hawkular Metrics from the API, you will only be able to perform
reads. Writing metrics has been disabled by default. If you want for individual
When accessing Hawkular Metrics from the API, you are only able to perform
reads. Writing metrics is disabled by default. If you want individual
users to also be able to write metrics, you must set the
`openshift_metrics_hawkular_user_write_access`
xref:../install_config/cluster_metrics.adoc#metrics-ansible-variables[variable]
to *true*.

However, it is recommended to use the default configuration and only have
metrics enter the system via Heapster. If write access is enabled, any user
will be able to write metrics to the system, which can affect performance and
can write metrics to the system, which can affect performance and
cause Cassandra disk usage to unpredictably increase.
====

Expand All @@ -676,7 +679,7 @@ privileges to access.
[[cluster-metrics-authorization]]
=== Authorization

The Hawkular Metrics service will authenticate the user against {product-title}
The Hawkular Metrics service authenticates the user against {product-title}
to determine if the user has access to the project it is trying to access.

Hawkular Metrics accepts a bearer token from the client and verifies that token
Expand All @@ -692,8 +695,8 @@ ifdef::openshift-origin[]
[[cluster-metrics-accessing-heapster-directly]]
== Accessing Heapster Directly

Heapster has been configured to be only accessible via the API proxy.
Accessing it will required either a cluster-reader or cluster-admin privileges.
Heapster is configured to only be accessible via the API proxy. Accessing
Heapster requires either a cluster-reader or cluster-admin privileges.

For example, to access the Heapster *validate* page, you need to access it
using something similar to:
Expand All @@ -718,8 +721,8 @@ Performance Guide].
== Integration with Aggregated Logging

Hawkular Alerts must be connected to the Aggregated Logging's Elasticsearch to
react on log events. By default, Hawkular will try to find Elasticsearch on its
default place (namespace `logging`, pod `logging-es`) at every boot. If the
react on log events. By default, Hawkular tries to find Elasticsearch on its
default place (namespace `logging`, pod `logging-es`) at every boot. If
Aggregated Logging is installed after Hawkular, the Hawkular Metrics pod might
need to be restarted in order to recognize the new Elasticsearch server. The
Hawkular boot log provides a clear indication if the integration could not be
Expand Down Expand Up @@ -754,7 +757,7 @@ available.
[[metrics-cleanup]]
== Cleanup

You can remove everything deployed by the OpenShift Ansible `openshift_metrics` role
You can remove everything deployed by the {product-title} Ansible `openshift_metrics` role
by performing the following steps:

----
Expand All @@ -771,7 +774,7 @@ system resources.

[IMPORTANT]
====
Prometheus on OpenShift is a Technology Preview feature only.
Prometheus on {product-title} is a Technology Preview feature only.
ifdef::openshift-enterprise[]
Technology Preview features are not supported with Red Hat production service
level agreements (SLAs), might not be functionally complete, and Red Hat does
Expand Down Expand Up @@ -912,7 +915,7 @@ The Prometheus server automatically exposes a Web UI at `localhost:9090`. You
can access the Prometheus Web UI with the `view` role.

[[openshift-prometheus-config]]
==== Configuring Prometheus for OpenShift
==== Configuring Prometheus for {product-title}
//
// Example Prometheus rules file:
// ----
Expand Down Expand Up @@ -1031,6 +1034,118 @@ Once `openshift_metrics_project: openshift-infra` is installed, metrics can be
gathered from the `http://${POD_IP}:7575/metrics` endpoint.
====

[[openshift-prometheus-kubernetes-metrics]]
=== {product-title} Metrics via Prometheus

The state of a system can be gauged by the metrics that it emits. This section
describes current and proposed metrics that identify the health of the storage subsystem and
cluster.

[[k8s-current-metrics]]
==== Current Metrics

This section describes the metrics currently emitted from Kubernetes’s storage subsystem.

*Cloud Provider API Call Metrics*

This metric reports the time and count of success and failures of all
cloudprovider API calls. These metrics include `aws_attach_time` and
`aws_detach_time`. The type of emitted metrics is a histogram, and hence,
Prometheus also generates sum, count, and bucket metrics for these metrics.

.Example summary of cloudprovider metrics from GCE:
----
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
----

.Example summary of cloudprovider metrics from AWS:
----
cloudprovider_aws_api_request_duration_seconds { request = "attach_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "detach_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "create_tags"}
cloudprovider_aws_api_request_duration_seconds { request = "create_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "delete_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "describe_instance"}
cloudprovider_aws_api_request_duration_seconds { request = "describe_volume"}
----

See
link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md[Cloud
Provider (specifically GCE and AWS) metrics for Storage API calls] for more
information.

*Volume Operation Metrics*

These metrics report time taken by a storage operation once started. These
metrics keep track of operation time at the plug-in level, but do not include
time taken by `goroutine` to run or operation to be picked up from the internal
queue. These metrics are a type of histogram.

.Example summary of available volume operation metrics
----
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "volume_provision" }
storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" }
storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" }
storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" }
storage_operation_duration_seconds { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" }
----

See
link:https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-metrics.md[Volume
operation metrics] for more information.

*Volume Stats Metrics*

These metrics typically report usage stats of PVC (such as used space vs available space). The type of metrics emitted is gauge.

.Volume Stats Metrics
|===
|Metric|Type|Labels/tags

|volume_stats_capacityBytes
|Gauge
|namespace,persistentvolumeclaim,persistentvolume=

|volume_stats_usedBytes
|Gauge
|namespace=<persistentvolumeclaim-namespace>
persistentvolumeclaim=<persistentvolumeclaim-name>
persistentvolume=<persistentvolume-name>

|volume_stats_availableBytes
|Gauge
|namespace=<persistentvolumeclaim-namespace>
persistentvolumeclaim=<persistentvolumeclaim-name>
persistentvolume=

|volume_stats_InodesFree
|Gauge
|namespace=<persistentvolumeclaim-namespace>
persistentvolumeclaim=<persistentvolumeclaim-name>
persistentvolume=<persistentvolume-name>

|volume_stats_Inodes
|Gauge
|namespace=<persistentvolumeclaim-namespace>
persistentvolumeclaim=<persistentvolumeclaim-name>
persistentvolume=<persistentvolume-name>

|volume_stats_InodesUsed
|Gauge
|namespace=<persistentvolumeclaim-namespace>
persistentvolumeclaim=<persistentvolumeclaim-name>
persistentvolume=<persistentvolume-name>
|===

[[openshift-prometheus-undeploy]]
=== Undeploying Prometheus

Expand Down