Skip to content

Commit 1e94e93

Browse files
committed
Expand following documentation in ccr overview (#39936)
This commit expands the ccr overview page to include more information about the lifecycle of following an index. It adds information linking to the remote recovery documentation. And describes how an index can fall-behind and how to fix it when this happens.
1 parent 8ede439 commit 1e94e93

File tree

2 files changed

+115
-10
lines changed

2 files changed

+115
-10
lines changed

docs/reference/ccr/overview.asciidoc

Lines changed: 109 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,51 @@ that {ccr} does not interfere with indexing on the leader index.
2222

2323
Replication can be configured in two ways:
2424

25-
* Manually using the
26-
{ref}/ccr-put-follow.html[create follower API]
25+
* Manually creating specific follower indices (in {kib} or by using the
26+
{ref}/ccr-put-follow.html[create follower API])
2727

28-
* Automatically using
29-
<<ccr-auto-follow,auto-follow patterns>>
28+
* Automatically creating follower indices from auto-follow patterns (in {kib} or
29+
by using the {ref}/ccr-put-auto-follow-pattern.html[create auto-follow pattern API])
30+
31+
For more information about managing {ccr} in {kib}, see
32+
{kibana-ref}/working-remote-clusters.html[Working with remote clusters].
3033

3134
NOTE: You must also <<ccr-requirements,configure the leader index>>.
3235

36+
When you initiate replication either manually or through an auto-follow pattern, the
37+
follower index is created on the local cluster. Once the follower index is created,
38+
the <<remote-recovery, remote recovery>> process copies all of the Lucene segment
39+
files from the remote cluster to the local cluster.
40+
41+
By default, if you initiate following manually (by using {kib} or the create follower API),
42+
the recovery process is asynchronous in relationship to the
43+
{ref}/ccr-put-follow.html[create follower request]. The request returns before
44+
the <<remote-recovery, remote recovery>> process completes. If you would like to wait on
45+
the process to complete, you can use the `wait_for_active_shards` parameter.
46+
47+
//////////////////////////
48+
49+
[source,js]
50+
--------------------------------------------------
51+
PUT /follower_index/_ccr/follow?wait_for_active_shards=1
52+
{
53+
"remote_cluster" : "remote_cluster",
54+
"leader_index" : "leader_index"
55+
}
56+
--------------------------------------------------
57+
// CONSOLE
58+
// TESTSETUP
59+
// TEST[setup:remote_cluster_and_leader_index]
60+
61+
[source,js]
62+
--------------------------------------------------
63+
POST /follower_index/_ccr/pause_follow
64+
--------------------------------------------------
65+
// CONSOLE
66+
// TEARDOWN
67+
68+
//////////////////////////
69+
3370
[float]
3471
=== The mechanics of replication
3572

@@ -57,7 +94,7 @@ If a read request fails, the cause of the failure is inspected. If the
5794
cause of the failure is deemed to be a failure that can be recovered from (for
5895
example, a network failure), the follower shard task enters into a retry
5996
loop. Otherwise, the follower shard task is paused and requires user
60-
intervention before the it can be resumed with the
97+
intervention before it can be resumed with the
6198
{ref}/ccr-post-resume-follow.html[resume follower API].
6299

63100
When operations are received by the follower shard task, they are placed in a
@@ -70,6 +107,10 @@ limits, no additional read requests are sent by the follower shard task. The
70107
follower shard task resumes sending read requests when the write buffer no
71108
longer exceeds its configured limits.
72109

110+
NOTE: The intricacies of how operations are replicated from the leader are
111+
governed by settings that you can configure when you create the follower index
112+
in {kib} or by using the {ref}/ccr-put-follow.html[create follower API].
113+
73114
Mapping updates applied to the leader index are automatically retrieved
74115
as-needed by the follower index.
75116

@@ -103,9 +144,71 @@ Using these APIs in tandem enables you to adjust the read and write parameters
103144
on the follower shard task if your initial configuration is not suitable for
104145
your use case.
105146

147+
[float]
148+
=== Leader index retaining operations for replication
149+
150+
If the follower is unable to replicate operations from a leader for a period of
151+
time, the following process can fail due to the leader lacking a complete history
152+
of operations necessary for replication.
153+
154+
Operations replicated to the follower are identified using a sequence number
155+
generated when the operation was initially performed. Lucene segment files are
156+
occasionally merged in order to optimize searches and save space. When these
157+
merges occur, it is possible for operations associated with deleted or updated
158+
documents to be pruned during the merge. When the follower requests the sequence
159+
number for a pruned operation, the process will fail due to the operation missing
160+
on the leader.
161+
162+
This scenario is not possible in an append-only workflow. As documents are never
163+
deleted or updated, the underlying operation will not be pruned.
164+
165+
Elasticsearch attempts to mitigate this potential issue for update workflows using
166+
a Lucene feature called soft deletes. When a document is updated or deleted, the
167+
underlying operation is retained in the Lucene index for a period of time. This
168+
period of time is governed by the `index.soft_deletes.retention_lease.period`
169+
setting which can be <<ccr-requirements,configured on the leader index>>.
170+
171+
When a follower initiates the index following, it acquires a retention lease from
172+
the leader. This informs the leader that it should not allow a soft delete to be
173+
pruned until either the follower indicates that it has received the operation or
174+
the lease expires. It is valuable to have monitoring in place to detect a follower
175+
replication issue prior to the lease expiring so that the problem can be remedied
176+
before the follower falls fatally behind.
177+
178+
[float]
179+
=== Remedying a follower that has fallen behind
180+
181+
If a follower falls sufficiently behind a leader that it can no longer replicate
182+
operations this can be detected in {kib} or by using the
183+
{ref}/ccr-get-follow-stats.html[get follow stats API]. It will be reported as a
184+
`indices[].fatal_exception`.
185+
186+
In order to restart the follower, you must pause the following process, close the
187+
index, and the create follower index again. For example:
188+
189+
["source","js"]
190+
----------------------------------------------------------------------
191+
POST /follower_index/_ccr/pause_follow
192+
193+
POST /follower_index/_close
194+
195+
PUT /follower_index/_ccr/follow?wait_for_active_shards=1
196+
{
197+
"remote_cluster" : "remote_cluster",
198+
"leader_index" : "leader_index"
199+
}
200+
----------------------------------------------------------------------
201+
// CONSOLE
202+
203+
Re-creating the follower index is a destructive action. All of the existing Lucene
204+
segment files are deleted on the follower cluster. The
205+
<<remote-recovery, remote recovery>> process copies the Lucene segment
206+
files from the leader again. After the follower index initializes, the
207+
following process starts again.
208+
106209
[float]
107210
=== Terminating replication
108211

109212
You can terminate replication with the
110213
{ref}/ccr-post-unfollow.html[unfollow API]. This API converts a follower index
111-
to a regular (non-follower) index.
214+
to a regular (non-follower) index.

docs/reference/ccr/requirements.asciidoc

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,13 @@ Whether or not soft deletes are enabled on the index. Soft deletes can only be
3434
configured at index creation and only on indices created on or after 6.5.0. The
3535
default value is `false`.
3636

37-
`index.soft_deletes.retention.operations`::
37+
`index.soft_deletes.retention_lease.period`::
3838

39-
The number of soft deletes to retain. Soft deletes are collected during merges
40-
on the underlying Lucene index yet retained up to the number of operations
41-
configured by this setting. The default value is `0`.
39+
The maximum period to retain a shard history retention lease before it is considered
40+
expired. Shard history retention leases ensure that soft deletes are retained during
41+
merges on the Lucene index. If a soft delete is merged away before it can be replicated
42+
to a follower the following process will fail due to incomplete history on the leader.
43+
The default value is `12h`.
4244

4345
For more information about index settings, see {ref}/index-modules.html[Index modules].
4446

0 commit comments

Comments
 (0)