Skip to content

Commit 0e2a53e

Browse files
committed
Docs for translog, history retention and flushing (#46245)
This commit updates the docs about translog retention and flushing to reflect recent changes in how peer recoveries work. It also adds some docs to describe how history is retained for replay using soft deletes and shard history retention leases. Relates #45473
1 parent 5e682a0 commit 0e2a53e

File tree

4 files changed

+219
-108
lines changed

4 files changed

+219
-108
lines changed

docs/reference/index-modules.asciidoc

+6
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,10 @@ Other index settings are available in index modules:
280280

281281
Control over the transaction log and background flush operations.
282282

283+
<<index-modules-history-retention,History retention>>::
284+
285+
Control over the retention of a history of operations in the index.
286+
283287
[float]
284288
[[x-pack-index-settings]]
285289
=== [xpack]#{xpack} index settings#
@@ -305,4 +309,6 @@ include::index-modules/store.asciidoc[]
305309

306310
include::index-modules/translog.asciidoc[]
307311

312+
include::index-modules/history-retention.asciidoc[]
313+
308314
include::index-modules/index-sorting.asciidoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
[[index-modules-history-retention]]
2+
== History retention
3+
4+
{es} sometimes needs to replay some of the operations that were performed on a
5+
shard. For instance, if a replica is briefly offline then it may be much more
6+
efficient to replay the few operations it missed while it was offline than to
7+
rebuild it from scratch. Similarly, {ccr} works by performing operations on the
8+
leader cluster and then replaying those operations on the follower cluster.
9+
10+
At the Lucene level there are really only two write operations that {es}
11+
performs on an index: a new document may be indexed, or an existing document may
12+
be deleted. Updates are implemented by atomically deleting the old document and
13+
then indexing the new document. A document indexed into Lucene already contains
14+
all the information needed to replay that indexing operation, but this is not
15+
true of document deletions. To solve this, {es} uses a feature called _soft
16+
deletes_ to preserve recent deletions in the Lucene index so that they can be
17+
replayed.
18+
19+
{es} only preserves certain recently-deleted documents in the index because a
20+
soft-deleted document still takes up some space. Eventually {es} will fully
21+
discard these soft-deleted documents to free up that space so that the index
22+
does not grow larger and larger over time. Fortunately {es} does not need to be
23+
able to replay every operation that has ever been performed on a shard, because
24+
it is always possible to make a full copy of a shard on a remote node. However,
25+
copying the whole shard may take much longer than replaying a few missing
26+
operations, so {es} tries to retain all of the operations it expects to need to
27+
replay in future.
28+
29+
{es} keeps track of the operations it expects to need to replay in future using
30+
a mechanism called _shard history retention leases_. Each shard copy that might
31+
need operations to be replayed must first create a shard history retention lease
32+
for itself. For example, this shard copy might be a replica of a shard or it
33+
might be a shard of a follower index when using {ccr}. Each retention lease
34+
keeps track of the sequence number of the first operation that the corresponding
35+
shard copy has not received. As the shard copy receives new operations, it
36+
increases the sequence number contained in its retention lease to indicate that
37+
it will not need to replay those operations in future. {es} discards
38+
soft-deleted operations once they are not being held by any retention lease.
39+
40+
If a shard copy fails then it stops updating its shard history retention lease,
41+
which means that {es} will preserve all new operations so they can be replayed
42+
when the failed shard copy recovers. However, retention leases only last for a
43+
limited amount of time. If the shard copy does not recover quickly enough then
44+
its retention lease may expire. This protects {es} from retaining history
45+
forever if a shard copy fails permanently, because once a retention lease has
46+
expired {es} can start to discard history again. If a shard copy recovers after
47+
its retention lease has expired then {es} will fall back to copying the whole
48+
index since it can no longer simply replay the missing history. The expiry time
49+
of a retention lease defaults to `12h` which should be long enough for most
50+
reasonable recovery scenarios.
51+
52+
Soft deletes are enabled by default on indices created in recent versions, but
53+
they can be explicitly enabled or disabled at index creation time. If soft
54+
deletes are disabled then peer recoveries can still sometimes take place by
55+
copying just the missing operations from the translog
56+
<<index-modules-translog-retention,as long as those operations are retained
57+
there>>. {ccr-cap} will not function if soft deletes are disabled.
58+
59+
[float]
60+
=== History retention settings
61+
62+
`index.soft_deletes.enabled`::
63+
64+
Whether or not soft deletes are enabled on the index. Soft deletes can only be
65+
configured at index creation and only on indices created on or after 6.5.0.
66+
The default value is `true`.
67+
68+
`index.soft_deletes.retention_lease.period`::
69+
70+
The maximum length of time to retain a shard history retention lease before
71+
it expires and the history that it retains can be discarded. The default
72+
value is `12h`.

docs/reference/index-modules/translog.asciidoc

+60-48
Original file line numberDiff line numberDiff line change
@@ -7,55 +7,57 @@ delete operation. Changes that happen after one commit and before another will
77
be removed from the index by Lucene in the event of process exit or hardware
88
failure.
99

10-
Because Lucene commits are too expensive to perform on every individual change,
11-
each shard copy also has a _transaction log_ known as its _translog_ associated
12-
with it. All index and delete operations are written to the translog after
10+
Lucene commits are too expensive to perform on every individual change, so each
11+
shard copy also writes operations into its _transaction log_ known as the
12+
_translog_. All index and delete operations are written to the translog after
1313
being processed by the internal Lucene index but before they are acknowledged.
14-
In the event of a crash, recent transactions that have been acknowledged but
15-
not yet included in the last Lucene commit can instead be recovered from the
16-
translog when the shard recovers.
14+
In the event of a crash, recent operations that have been acknowledged but not
15+
yet included in the last Lucene commit are instead recovered from the translog
16+
when the shard recovers.
1717

18-
An Elasticsearch flush is the process of performing a Lucene commit and
19-
starting a new translog. Flushes are performed automatically in the background
20-
in order to make sure the translog doesn't grow too large, which would make
21-
replaying its operations take a considerable amount of time during recovery.
22-
The ability to perform a flush manually is also exposed through an API,
23-
although this is rarely needed.
18+
An {es} <<indices-flush,flush>> is the process of performing a Lucene commit and
19+
starting a new translog generation. Flushes are performed automatically in the
20+
background in order to make sure the translog does not grow too large, which
21+
would make replaying its operations take a considerable amount of time during
22+
recovery. The ability to perform a flush manually is also exposed through an
23+
API, although this is rarely needed.
2424

2525
[float]
2626
=== Translog settings
2727

2828
The data in the translog is only persisted to disk when the translog is
29-
++fsync++ed and committed. In the event of a hardware failure or an operating
29+
++fsync++ed and committed. In the event of a hardware failure or an operating
3030
system crash or a JVM crash or a shard failure, any data written since the
3131
previous translog commit will be lost.
3232

33-
By default, `index.translog.durability` is set to `request` meaning that Elasticsearch will only report success of an index, delete,
34-
update, or bulk request to the client after the translog has been successfully
35-
++fsync++ed and committed on the primary and on every allocated replica. If
36-
`index.translog.durability` is set to `async` then Elasticsearch ++fsync++s
37-
and commits the translog every `index.translog.sync_interval` (defaults to 5 seconds).
33+
By default, `index.translog.durability` is set to `request` meaning that
34+
Elasticsearch will only report success of an index, delete, update, or bulk
35+
request to the client after the translog has been successfully ++fsync++ed and
36+
committed on the primary and on every allocated replica. If
37+
`index.translog.durability` is set to `async` then Elasticsearch ++fsync++s and
38+
commits the translog only every `index.translog.sync_interval` which means that
39+
any operations that were performed just before a crash may be lost when the node
40+
recovers.
3841

3942
The following <<indices-update-settings,dynamically updatable>> per-index
4043
settings control the behaviour of the translog:
4144

4245
`index.translog.sync_interval`::
4346

44-
How often the translog is ++fsync++ed to disk and committed, regardless of
45-
write operations. Defaults to `5s`. Values less than `100ms` are not allowed.
47+
How often the translog is ++fsync++ed to disk and committed, regardless of
48+
write operations. Defaults to `5s`. Values less than `100ms` are not allowed.
4649

4750
`index.translog.durability`::
4851
+
4952
--
5053

5154
Whether or not to `fsync` and commit the translog after every index, delete,
52-
update, or bulk request. This setting accepts the following parameters:
55+
update, or bulk request. This setting accepts the following parameters:
5356

5457
`request`::
5558

56-
(default) `fsync` and commit after every request. In the event
57-
of hardware failure, all acknowledged writes will already have been
58-
committed to disk.
59+
(default) `fsync` and commit after every request. In the event of hardware
60+
failure, all acknowledged writes will already have been committed to disk.
5961

6062
`async`::
6163

@@ -66,33 +68,43 @@ update, or bulk request. This setting accepts the following parameters:
6668

6769
`index.translog.flush_threshold_size`::
6870

69-
The translog stores all operations that are not yet safely persisted in Lucene
70-
(i.e., are not part of a Lucene commit point). Although these operations are
71-
available for reads, they will need to be reindexed if the shard was to
72-
shutdown and has to be recovered. This settings controls the maximum total size
73-
of these operations, to prevent recoveries from taking too long. Once the
74-
maximum size has been reached a flush will happen, generating a new Lucene
75-
commit point. Defaults to `512mb`.
71+
The translog stores all operations that are not yet safely persisted in Lucene
72+
(i.e., are not part of a Lucene commit point). Although these operations are
73+
available for reads, they will need to be replayed if the shard was stopped
74+
and had to be recovered. This setting controls the maximum total size of these
75+
operations, to prevent recoveries from taking too long. Once the maximum size
76+
has been reached a flush will happen, generating a new Lucene commit point.
77+
Defaults to `512mb`.
7678

77-
`index.translog.retention.size`::
78-
79-
When soft deletes is disabled (enabled by default in 7.0 or later),
80-
`index.translog.retention.size` controls the total size of translog files to keep.
81-
Keeping more translog files increases the chance of performing an operation based
82-
sync when recovering replicas. If the translog files are not sufficient,
83-
replica recovery will fall back to a file based sync. Defaults to `512mb`
79+
[float]
80+
[[index-modules-translog-retention]]
81+
==== Translog retention
82+
83+
If an index is not using <<index-modules-history-retention,soft deletes>> to
84+
retain historical operations then {es} recovers each replica shard by replaying
85+
operations from the primary's translog. This means it is important for the
86+
primary to preserve extra operations in its translog in case it needs to
87+
rebuild a replica. Moreover it is important for each replica to preserve extra
88+
operations in its translog in case it is promoted to primary and then needs to
89+
rebuild its own replicas in turn. The following settings control how much
90+
translog is retained for peer recoveries.
8491

85-
Both `index.translog.retention.size` and `index.translog.retention.age` should not
86-
be specified unless soft deletes is disabled as they will be ignored.
92+
`index.translog.retention.size`::
8793

94+
This controls the total size of translog files to keep for each shard.
95+
Keeping more translog files increases the chance of performing an operation
96+
based sync when recovering a replica. If the translog files are not
97+
sufficient, replica recovery will fall back to a file based sync. Defaults to
98+
`512mb`. This setting is ignored, and should not be set, if soft deletes are
99+
enabled. Soft deletes are enabled by default in indices created in {es}
100+
versions 7.0.0 and later.
88101

89102
`index.translog.retention.age`::
90103

91-
When soft deletes is disabled (enabled by default in 7.0 or later),
92-
`index.translog.retention.age` controls the maximum duration for which translog
93-
files to keep. Keeping more translog files increases the chance of performing an
94-
operation based sync when recovering replicas. If the translog files are not sufficient,
95-
replica recovery will fall back to a file based sync. Defaults to `12h`
96-
97-
Both `index.translog.retention.size` and `index.translog.retention.age` should not
98-
be specified unless soft deletes is disabled as they will be ignored.
104+
This controls the maximum duration for which translog files are kept by each
105+
shard. Keeping more translog files increases the chance of performing an
106+
operation based sync when recovering replicas. If the translog files are not
107+
sufficient, replica recovery will fall back to a file based sync. Defaults to
108+
`12h`. This setting is ignored, and should not be set, if soft deletes are
109+
enabled. Soft deletes are enabled by default in indices created in {es}
110+
versions 7.0.0 and later.

0 commit comments

Comments
 (0)