Add documentation on remote recovery (#39483)

Tim-Brooks · web-flow · commit ee41b22e51a7 · 2019-03-05T09:50:58.000-07:00
This is related to #35975. It adds documentation on the remote recovery process. Additionally, it adds documentation about the various settings that can impact the process.
diff --git a/docs/reference/ccr/getting-started.asciidoc b/docs/reference/ccr/getting-started.asciidoc
@@ -253,6 +253,11 @@ PUT /server-metrics-copy/_ccr/follow?wait_for_active_shards=1
 
 //////////////////////////
 
+The follower index is initialized using the <<remote-recovery, remote recovery>>
+process. The remote recovery process transfers the existing Lucene segment files
+from the leader to the follower. When the remote recovery process is complete,
+the index following begins.
+
 Now when you index documents into your leader index, you will see these
 documents replicated in the follower index. You can
 inspect the status of replication using the
diff --git a/docs/reference/ccr/index.asciidoc b/docs/reference/ccr/index.asciidoc
@@ -29,3 +29,4 @@ include::overview.asciidoc[]
 include::requirements.asciidoc[]
 include::auto-follow.asciidoc[]
 include::getting-started.asciidoc[]
+include::remote-recovery.asciidoc[]
diff --git a/docs/reference/ccr/remote-recovery.asciidoc b/docs/reference/ccr/remote-recovery.asciidoc
@@ -0,0 +1,29 @@
+[role="xpack"]
+[testenv="platinum"]
+[[remote-recovery]]
+== Remote recovery
+
+When you create a follower index, you cannot use it until it is fully initialized.
+The _remote recovery_ process builds a new copy of a shard on a follower node by
+copying data from the primary shard in the leader cluster. {es} uses this remote
+recovery process to bootstrap a follower index using the data from the leader index.
+This process provides the follower with a copy of the current state of the leader index,
+even if a complete history of changes is not available on the leader due to Lucene
+segment merging.
+
+Remote recovery is a network intensive process that transfers all of the Lucene
+segment files from the leader cluster to the follower cluster. The follower
+requests that a recovery session be initiated on the primary shard in the leader
+cluster. The follower then requests file chunks concurrently from the leader. By
+default, the process concurrently requests `5` large `1mb` file chunks. This default
+behavior is designed to support leader and follower clusters with high network latency
+between them.
+
+There are dynamic settings that you can use to rate-limit the transmitted data
+and manage the resources consumed by remote recoveries. See
+{ref}/ccr-settings.html[{ccr-cap} settings].
+
+You can obtain information about an in-progress remote recovery by using the
+{ref}/cat-recovery.html[recovery API] on the follower cluster. Remote recoveries
+are implemented using the {ref}/modules-snapshots.html[snapshot and restore] infrastructure. This means that on-going remote recoveries are labelled as type
+`snapshot` in the recovery API.
diff --git a/docs/reference/settings/ccr-settings.asciidoc b/docs/reference/settings/ccr-settings.asciidoc
@@ -0,0 +1,52 @@
+[role="xpack"]
+[[ccr-settings]]
+=== {ccr-cap} settings
+
+These {ccr} settings can be dynamically updated on a live cluster with the
+<<cluster-update-settings,cluster update settings API>>.
+
+[float]
+[[ccr-recovery-settings]]
+==== Remote recovery settings
+
+The following setting can be used to rate-limit the data transmitted during
+{stack-ov}/remote-recovery.html[remote recoveries]:
+
+`ccr.indices.recovery.max_bytes_per_sec` (<<cluster-update-settings,Dynamic>>)::
+Limits the total inbound and outbound remote recovery traffic on each node.
+Since this limit applies on each node, but there may be many nodes performing
+remote recoveries concurrently, the total amount of remote recovery bytes may be
+much higher than this limit. If you set this limit too high then there is a risk
+that ongoing remote recoveries will consume an excess of bandwidth (or other
+resources) which could destabilize the cluster. This setting is used by both the
+leader and follower clusters. For example if it is set to `20mb` on a leader,
+the leader will only send `20mb/s` to the follower even if the follower is
+requesting and can accept `60mb/s`. Defaults to `40mb`.
+
+[float]
+[[ccr-advanced-recovery-settings]]
+==== Advanced remote recovery settings
+
+The following _expert_ settings can be set to manage the resources consumed by
+remote recoveries:
+
+`ccr.indices.recovery.max_concurrent_file_chunks` (<<cluster-update-settings,Dynamic>>)::
+Controls the number of file chunk requests that can be sent in parallel per
+recovery. As multiple remote recoveries might already running in parallel,
+increasing this expert-level setting might only help in situations where remote
+recovery of a single shard is not reaching the total inbound and outbound remote recovery traffic as configured by `ccr.indices.recovery.max_bytes_per_sec`.
+Defaults to `5`. The maximum allowed value is `10`.
+
+`ccr.indices.recovery.chunk_size`(<<cluster-update-settings,Dynamic>>)::
+Controls the chunk size requested by the follower during file transfer. Defaults to
+`1mb`.
+
+`ccr.indices.recovery.recovery_activity_timeout`(<<cluster-update-settings,Dynamic>>)::
+Controls the timeout for recovery activity. This timeout primarily applies on
+the leader cluster. The leader cluster must open resources in-memory to supply
+data to the follower during the recovery process. If the leader does not receive recovery requests from the follower for this period of time, it will close the resources. Defaults to 60 seconds.
+
+`ccr.indices.recovery.internal_action_timeout` (<<cluster-update-settings,Dynamic>>)::
+Controls the timeout for individual network requests during the remote recovery
+process. An individual action timing out can fail the recovery. Defaults to
+60 seconds.
diff --git a/docs/reference/settings/configuring-xes.asciidoc b/docs/reference/settings/configuring-xes.asciidoc
@@ -6,7 +6,8 @@
 ++++
 
 include::{asciidoc-dir}/../../shared/settings.asciidoc[]
+include::ccr-settings.asciidoc[]
 include::license-settings.asciidoc[]
 include::ml-settings.asciidoc[]
-include::notification-settings.asciidoc[]
 include::sql-settings.asciidoc[]
+include::notification-settings.asciidoc[]