Skip to content

Commit ee41b22

Browse files
authored
Add documentation on remote recovery (#39483)
This is related to #35975. It adds documentation on the remote recovery process. Additionally, it adds documentation about the various settings that can impact the process.
1 parent fef2153 commit ee41b22

File tree

5 files changed

+89
-1
lines changed

5 files changed

+89
-1
lines changed

docs/reference/ccr/getting-started.asciidoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,11 @@ PUT /server-metrics-copy/_ccr/follow?wait_for_active_shards=1
253253
254254
//////////////////////////
255255

256+
The follower index is initialized using the <<remote-recovery, remote recovery>>
257+
process. The remote recovery process transfers the existing Lucene segment files
258+
from the leader to the follower. When the remote recovery process is complete,
259+
the index following begins.
260+
256261
Now when you index documents into your leader index, you will see these
257262
documents replicated in the follower index. You can
258263
inspect the status of replication using the

docs/reference/ccr/index.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,4 @@ include::overview.asciidoc[]
2929
include::requirements.asciidoc[]
3030
include::auto-follow.asciidoc[]
3131
include::getting-started.asciidoc[]
32+
include::remote-recovery.asciidoc[]
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
[role="xpack"]
2+
[testenv="platinum"]
3+
[[remote-recovery]]
4+
== Remote recovery
5+
6+
When you create a follower index, you cannot use it until it is fully initialized.
7+
The _remote recovery_ process builds a new copy of a shard on a follower node by
8+
copying data from the primary shard in the leader cluster. {es} uses this remote
9+
recovery process to bootstrap a follower index using the data from the leader index.
10+
This process provides the follower with a copy of the current state of the leader index,
11+
even if a complete history of changes is not available on the leader due to Lucene
12+
segment merging.
13+
14+
Remote recovery is a network intensive process that transfers all of the Lucene
15+
segment files from the leader cluster to the follower cluster. The follower
16+
requests that a recovery session be initiated on the primary shard in the leader
17+
cluster. The follower then requests file chunks concurrently from the leader. By
18+
default, the process concurrently requests `5` large `1mb` file chunks. This default
19+
behavior is designed to support leader and follower clusters with high network latency
20+
between them.
21+
22+
There are dynamic settings that you can use to rate-limit the transmitted data
23+
and manage the resources consumed by remote recoveries. See
24+
{ref}/ccr-settings.html[{ccr-cap} settings].
25+
26+
You can obtain information about an in-progress remote recovery by using the
27+
{ref}/cat-recovery.html[recovery API] on the follower cluster. Remote recoveries
28+
are implemented using the {ref}/modules-snapshots.html[snapshot and restore] infrastructure. This means that on-going remote recoveries are labelled as type
29+
`snapshot` in the recovery API.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
[role="xpack"]
2+
[[ccr-settings]]
3+
=== {ccr-cap} settings
4+
5+
These {ccr} settings can be dynamically updated on a live cluster with the
6+
<<cluster-update-settings,cluster update settings API>>.
7+
8+
[float]
9+
[[ccr-recovery-settings]]
10+
==== Remote recovery settings
11+
12+
The following setting can be used to rate-limit the data transmitted during
13+
{stack-ov}/remote-recovery.html[remote recoveries]:
14+
15+
`ccr.indices.recovery.max_bytes_per_sec` (<<cluster-update-settings,Dynamic>>)::
16+
Limits the total inbound and outbound remote recovery traffic on each node.
17+
Since this limit applies on each node, but there may be many nodes performing
18+
remote recoveries concurrently, the total amount of remote recovery bytes may be
19+
much higher than this limit. If you set this limit too high then there is a risk
20+
that ongoing remote recoveries will consume an excess of bandwidth (or other
21+
resources) which could destabilize the cluster. This setting is used by both the
22+
leader and follower clusters. For example if it is set to `20mb` on a leader,
23+
the leader will only send `20mb/s` to the follower even if the follower is
24+
requesting and can accept `60mb/s`. Defaults to `40mb`.
25+
26+
[float]
27+
[[ccr-advanced-recovery-settings]]
28+
==== Advanced remote recovery settings
29+
30+
The following _expert_ settings can be set to manage the resources consumed by
31+
remote recoveries:
32+
33+
`ccr.indices.recovery.max_concurrent_file_chunks` (<<cluster-update-settings,Dynamic>>)::
34+
Controls the number of file chunk requests that can be sent in parallel per
35+
recovery. As multiple remote recoveries might already running in parallel,
36+
increasing this expert-level setting might only help in situations where remote
37+
recovery of a single shard is not reaching the total inbound and outbound remote recovery traffic as configured by `ccr.indices.recovery.max_bytes_per_sec`.
38+
Defaults to `5`. The maximum allowed value is `10`.
39+
40+
`ccr.indices.recovery.chunk_size`(<<cluster-update-settings,Dynamic>>)::
41+
Controls the chunk size requested by the follower during file transfer. Defaults to
42+
`1mb`.
43+
44+
`ccr.indices.recovery.recovery_activity_timeout`(<<cluster-update-settings,Dynamic>>)::
45+
Controls the timeout for recovery activity. This timeout primarily applies on
46+
the leader cluster. The leader cluster must open resources in-memory to supply
47+
data to the follower during the recovery process. If the leader does not receive recovery requests from the follower for this period of time, it will close the resources. Defaults to 60 seconds.
48+
49+
`ccr.indices.recovery.internal_action_timeout` (<<cluster-update-settings,Dynamic>>)::
50+
Controls the timeout for individual network requests during the remote recovery
51+
process. An individual action timing out can fail the recovery. Defaults to
52+
60 seconds.

docs/reference/settings/configuring-xes.asciidoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
++++
77

88
include::{asciidoc-dir}/../../shared/settings.asciidoc[]
9+
include::ccr-settings.asciidoc[]
910
include::license-settings.asciidoc[]
1011
include::ml-settings.asciidoc[]
11-
include::notification-settings.asciidoc[]
1212
include::sql-settings.asciidoc[]
13+
include::notification-settings.asciidoc[]

0 commit comments

Comments
 (0)