diff --git a/docs/reference/index-modules.asciidoc b/docs/reference/index-modules.asciidoc index b3729f8477df8..93da06d9b9674 100644 --- a/docs/reference/index-modules.asciidoc +++ b/docs/reference/index-modules.asciidoc @@ -280,6 +280,10 @@ Other index settings are available in index modules: Control over the transaction log and background flush operations. +<>:: + + Control over the retention of a history of operations in the index. + [float] [[x-pack-index-settings]] === [xpack]#{xpack} index settings# @@ -305,4 +309,6 @@ include::index-modules/store.asciidoc[] include::index-modules/translog.asciidoc[] +include::index-modules/history-retention.asciidoc[] + include::index-modules/index-sorting.asciidoc[] diff --git a/docs/reference/index-modules/history-retention.asciidoc b/docs/reference/index-modules/history-retention.asciidoc new file mode 100644 index 0000000000000..94e17e49251e3 --- /dev/null +++ b/docs/reference/index-modules/history-retention.asciidoc @@ -0,0 +1,72 @@ +[[index-modules-history-retention]] +== History retention + +{es} sometimes needs to replay some of the operations that were performed on a +shard. For instance, if a replica is briefly offline then it may be much more +efficient to replay the few operations it missed while it was offline than to +rebuild it from scratch. Similarly, {ccr} works by performing operations on the +leader cluster and then replaying those operations on the follower cluster. + +At the Lucene level there are really only two write operations that {es} +performs on an index: a new document may be indexed, or an existing document may +be deleted. Updates are implemented by atomically deleting the old document and +then indexing the new document. A document indexed into Lucene already contains +all the information needed to replay that indexing operation, but this is not +true of document deletions. To solve this, {es} uses a feature called _soft +deletes_ to preserve recent deletions in the Lucene index so that they can be +replayed. + +{es} only preserves certain recently-deleted documents in the index because a +soft-deleted document still takes up some space. Eventually {es} will fully +discard these soft-deleted documents to free up that space so that the index +does not grow larger and larger over time. Fortunately {es} does not need to be +able to replay every operation that has ever been performed on a shard, because +it is always possible to make a full copy of a shard on a remote node. However, +copying the whole shard may take much longer than replaying a few missing +operations, so {es} tries to retain all of the operations it expects to need to +replay in future. + +{es} keeps track of the operations it expects to need to replay in future using +a mechanism called _shard history retention leases_. Each shard copy that might +need operations to be replayed must first create a shard history retention lease +for itself. For example, this shard copy might be a replica of a shard or it +might be a shard of a follower index when using {ccr}. Each retention lease +keeps track of the sequence number of the first operation that the corresponding +shard copy has not received. As the shard copy receives new operations, it +increases the sequence number contained in its retention lease to indicate that +it will not need to replay those operations in future. {es} discards +soft-deleted operations once they are not being held by any retention lease. + +If a shard copy fails then it stops updating its shard history retention lease, +which means that {es} will preserve all new operations so they can be replayed +when the failed shard copy recovers. However, retention leases only last for a +limited amount of time. If the shard copy does not recover quickly enough then +its retention lease may expire. This protects {es} from retaining history +forever if a shard copy fails permanently, because once a retention lease has +expired {es} can start to discard history again. If a shard copy recovers after +its retention lease has expired then {es} will fall back to copying the whole +index since it can no longer simply replay the missing history. The expiry time +of a retention lease defaults to `12h` which should be long enough for most +reasonable recovery scenarios. + +Soft deletes are enabled by default on indices created in recent versions, but +they can be explicitly enabled or disabled at index creation time. If soft +deletes are disabled then peer recoveries can still sometimes take place by +copying just the missing operations from the translog +<>. {ccr-cap} will not function if soft deletes are disabled. + +[float] +=== History retention settings + +`index.soft_deletes.enabled`:: + + Whether or not soft deletes are enabled on the index. Soft deletes can only be + configured at index creation and only on indices created on or after 6.5.0. + The default value is `true`. + +`index.soft_deletes.retention_lease.period`:: + + The maximum length of time to retain a shard history retention lease before + it expires and the history that it retains can be discarded. The default + value is `12h`. diff --git a/docs/reference/index-modules/translog.asciidoc b/docs/reference/index-modules/translog.asciidoc index 414ac59f0ba27..48947b348a47c 100644 --- a/docs/reference/index-modules/translog.asciidoc +++ b/docs/reference/index-modules/translog.asciidoc @@ -7,55 +7,57 @@ delete operation. Changes that happen after one commit and before another will be removed from the index by Lucene in the event of process exit or hardware failure. -Because Lucene commits are too expensive to perform on every individual change, -each shard copy also has a _transaction log_ known as its _translog_ associated -with it. All index and delete operations are written to the translog after +Lucene commits are too expensive to perform on every individual change, so each +shard copy also writes operations into its _transaction log_ known as the +_translog_. All index and delete operations are written to the translog after being processed by the internal Lucene index but before they are acknowledged. -In the event of a crash, recent transactions that have been acknowledged but -not yet included in the last Lucene commit can instead be recovered from the -translog when the shard recovers. +In the event of a crash, recent operations that have been acknowledged but not +yet included in the last Lucene commit are instead recovered from the translog +when the shard recovers. -An Elasticsearch flush is the process of performing a Lucene commit and -starting a new translog. Flushes are performed automatically in the background -in order to make sure the translog doesn't grow too large, which would make -replaying its operations take a considerable amount of time during recovery. -The ability to perform a flush manually is also exposed through an API, -although this is rarely needed. +An {es} <> is the process of performing a Lucene commit and +starting a new translog generation. Flushes are performed automatically in the +background in order to make sure the translog does not grow too large, which +would make replaying its operations take a considerable amount of time during +recovery. The ability to perform a flush manually is also exposed through an +API, although this is rarely needed. [float] === Translog settings The data in the translog is only persisted to disk when the translog is -++fsync++ed and committed. In the event of a hardware failure or an operating +++fsync++ed and committed. In the event of a hardware failure or an operating system crash or a JVM crash or a shard failure, any data written since the previous translog commit will be lost. -By default, `index.translog.durability` is set to `request` meaning that Elasticsearch will only report success of an index, delete, -update, or bulk request to the client after the translog has been successfully -++fsync++ed and committed on the primary and on every allocated replica. If -`index.translog.durability` is set to `async` then Elasticsearch ++fsync++s -and commits the translog every `index.translog.sync_interval` (defaults to 5 seconds). +By default, `index.translog.durability` is set to `request` meaning that +Elasticsearch will only report success of an index, delete, update, or bulk +request to the client after the translog has been successfully ++fsync++ed and +committed on the primary and on every allocated replica. If +`index.translog.durability` is set to `async` then Elasticsearch ++fsync++s and +commits the translog only every `index.translog.sync_interval` which means that +any operations that were performed just before a crash may be lost when the node +recovers. The following <> per-index settings control the behaviour of the translog: `index.translog.sync_interval`:: -How often the translog is ++fsync++ed to disk and committed, regardless of -write operations. Defaults to `5s`. Values less than `100ms` are not allowed. + How often the translog is ++fsync++ed to disk and committed, regardless of + write operations. Defaults to `5s`. Values less than `100ms` are not allowed. `index.translog.durability`:: + -- Whether or not to `fsync` and commit the translog after every index, delete, -update, or bulk request. This setting accepts the following parameters: +update, or bulk request. This setting accepts the following parameters: `request`:: - (default) `fsync` and commit after every request. In the event - of hardware failure, all acknowledged writes will already have been - committed to disk. + (default) `fsync` and commit after every request. In the event of hardware + failure, all acknowledged writes will already have been committed to disk. `async`:: @@ -66,33 +68,43 @@ update, or bulk request. This setting accepts the following parameters: `index.translog.flush_threshold_size`:: -The translog stores all operations that are not yet safely persisted in Lucene -(i.e., are not part of a Lucene commit point). Although these operations are -available for reads, they will need to be reindexed if the shard was to -shutdown and has to be recovered. This settings controls the maximum total size -of these operations, to prevent recoveries from taking too long. Once the -maximum size has been reached a flush will happen, generating a new Lucene -commit point. Defaults to `512mb`. + The translog stores all operations that are not yet safely persisted in Lucene + (i.e., are not part of a Lucene commit point). Although these operations are + available for reads, they will need to be replayed if the shard was stopped + and had to be recovered. This setting controls the maximum total size of these + operations, to prevent recoveries from taking too long. Once the maximum size + has been reached a flush will happen, generating a new Lucene commit point. + Defaults to `512mb`. -`index.translog.retention.size`:: - -When soft deletes is disabled (enabled by default in 7.0 or later), -`index.translog.retention.size` controls the total size of translog files to keep. -Keeping more translog files increases the chance of performing an operation based -sync when recovering replicas. If the translog files are not sufficient, -replica recovery will fall back to a file based sync. Defaults to `512mb` +[float] +[[index-modules-translog-retention]] +==== Translog retention + +If an index is not using <> to +retain historical operations then {es} recovers each replica shard by replaying +operations from the primary's translog. This means it is important for the +primary to preserve extra operations in its translog in case it needs to +rebuild a replica. Moreover it is important for each replica to preserve extra +operations in its translog in case it is promoted to primary and then needs to +rebuild its own replicas in turn. The following settings control how much +translog is retained for peer recoveries. -Both `index.translog.retention.size` and `index.translog.retention.age` should not -be specified unless soft deletes is disabled as they will be ignored. +`index.translog.retention.size`:: + This controls the total size of translog files to keep for each shard. + Keeping more translog files increases the chance of performing an operation + based sync when recovering a replica. If the translog files are not + sufficient, replica recovery will fall back to a file based sync. Defaults to + `512mb`. This setting is ignored, and should not be set, if soft deletes are + enabled. Soft deletes are enabled by default in indices created in {es} + versions 7.0.0 and later. `index.translog.retention.age`:: -When soft deletes is disabled (enabled by default in 7.0 or later), -`index.translog.retention.age` controls the maximum duration for which translog -files to keep. Keeping more translog files increases the chance of performing an -operation based sync when recovering replicas. If the translog files are not sufficient, -replica recovery will fall back to a file based sync. Defaults to `12h` - -Both `index.translog.retention.size` and `index.translog.retention.age` should not -be specified unless soft deletes is disabled as they will be ignored. + This controls the maximum duration for which translog files are kept by each + shard. Keeping more translog files increases the chance of performing an + operation based sync when recovering replicas. If the translog files are not + sufficient, replica recovery will fall back to a file based sync. Defaults to + `12h`. This setting is ignored, and should not be set, if soft deletes are + enabled. Soft deletes are enabled by default in indices created in {es} + versions 7.0.0 and later. diff --git a/docs/reference/indices/flush.asciidoc b/docs/reference/indices/flush.asciidoc index 29a5b6d4d28a4..333cbc98b8d33 100644 --- a/docs/reference/indices/flush.asciidoc +++ b/docs/reference/indices/flush.asciidoc @@ -1,13 +1,26 @@ [[indices-flush]] === Flush -The flush API allows to flush one or more indices through an API. The -flush process of an index makes sure that any data that is currently only -persisted in the <> is also permanently -persisted in Lucene. This reduces recovery times as that data doesn't need to be -reindexed from the transaction logs after the Lucene indexed is opened. By -default, Elasticsearch uses heuristics in order to automatically -trigger flushes as required. It is rare for users to need to call the API directly. +Flushing an index is the process of making sure that any data that is currently +only stored in the <> is also +permanently stored in the Lucene index. When restarting, {es} replays any +unflushed operations from the transaction log into the Lucene index to bring it +back into the state that it was in before the restart. {es} automatically +triggers flushes as needed, using heuristics that trade off the size of the +unflushed transaction log against the cost of performing each flush. + +Once each operation has been flushed it is permanently stored in the Lucene +index. This may mean that there is no need to maintain an additional copy of it +in the transaction log, unless <>. The transaction log is made up of multiple files, +called _generations_, and {es} will delete any generation files once they are no +longer needed, freeing up disk space. + +It is also possible to trigger a flush on one or more indices using the flush +API, although it is rare for users to need to call this API directly. If you +call the flush API after indexing some documents then a successful response +indicates that {es} has flushed all the documents that were indexed before the +flush API was called. [source,js] -------------------------------------------------- @@ -23,20 +36,22 @@ POST twitter/_flush The flush API accepts the following request parameters: [horizontal] -`wait_if_ongoing`:: If set to `true`(the default value) the flush operation will -block until the flush can be executed if another flush operation is already executing. +`wait_if_ongoing`:: If set to `true` the flush operation will block until the +flush can be executed if another flush operation is already executing. If set to +`false` then an exception will be thrown on the shard level if another flush +operation is already running. Defaults to `true`. -`force`:: Whether a flush should be forced even if it is not necessarily needed i.e. -if no changes will be committed to the index. This is useful if transaction log IDs -should be incremented even if no uncommitted changes are present. -(This setting can be considered as internal) +`force`:: Whether a flush should be forced even if it is not necessarily needed +i.e. if no changes will be committed to the index. This can be used to force +the generation number of the transaction log to be incremented even if no +uncommitted changes are present. This parameter should be considered internal. [float] [[flush-multi-index]] ==== Multi Index -The flush API can be applied to more than one index with a single call, -or even on `_all` the indices. +The flush API can be applied to more than one index with a single call, or even +on `_all` the indices. [source,js] -------------------------------------------------- @@ -50,26 +65,28 @@ POST _flush [[synced-flush-api]] ==== Synced Flush -Elasticsearch tracks the indexing activity of each shard. Shards that have not -received any indexing operations for 5 minutes are automatically marked as inactive. This presents -an opportunity for Elasticsearch to reduce shard resources and also perform -a special kind of flush, called `synced flush`. A synced flush performs a normal flush, then adds -a generated unique marker (sync_id) to all shards. - -Since the sync id marker was added when there were no ongoing indexing operations, it can -be used as a quick way to check if the two shards' lucene indices are identical. This quick sync id -comparison (if present) is used during recovery or restarts to skip the first and -most costly phase of the process. In that case, no segment files need to be copied and -the transaction log replay phase of the recovery can start immediately. Note that since the sync id -marker was applied together with a flush, it is very likely that the transaction log will be empty, -speeding up recoveries even more. - -This is particularly useful for use cases having lots of indices which are -never or very rarely updated, such as time based data. This use case typically generates lots of indices whose -recovery without the synced flush marker would take a long time. - -To check whether a shard has a marker or not, look for the `commit` section of shard stats returned by -the <> API: +{es} keeps track of which shards have received indexing activity recently, and +considers shards that have not received any indexing operations for 5 minutes to +be inactive. When a shard becomes inactive {es} performs a special kind of flush +known as a _synced flush_. A synced flush performs a normal +<> on each copy of the shard, and then adds a marker known +as the `sync_id` to each copy to indicate that these copies have identical +Lucene indices. Comparing the `sync_id` markers of the two copies is a very +efficient way to check whether they have identical contents. + +When allocating shard copies, {es} must ensure that each replica contains the +same data as the primary. If the shard copies have been synced-flushed and the +replica shares a `sync_id` with the primary then {es} knows that the two copies +have identical contents. This means there is no need to copy any segment files +from the primary to the replica, which saves a good deal of time during +recoveries and restarts. + +This is particularly useful for clusters having lots of indices which are very +rarely updated, such as with time-based indices. Without the synced flush +marker, recovery of this kind of cluster would be much slower. + +To check whether a shard has a `sync_id` marker or not, look for the `commit` +section of the shard stats returned by the <> API: [source,sh] -------------------------------------------------- @@ -118,26 +135,26 @@ which returns something similar to: // TESTRESPONSE[s/"sync_id" : "AVvFY-071siAOuFGEO9P"/"sync_id": $body.indices.twitter.shards.0.0.commit.user_data.sync_id/] <1> the `sync id` marker +NOTE: The `sync_id` marker is removed as soon as the shard is flushed again, and +{es} may trigger an automatic flush of a shard at any time if there are +unflushed operations in the shard's translog. In practice this means that one +should consider any indexing operation on an index as having removed its +`sync_id` markers. + [float] ==== Synced Flush API -The Synced Flush API allows an administrator to initiate a synced flush manually. This can be particularly useful for -a planned (rolling) cluster restart where you can stop indexing and don't want to wait the default 5 minutes for -idle indices to be sync-flushed automatically. - -While handy, there are a couple of caveats for this API: - -1. Synced flush is a best effort operation. Any ongoing indexing operations will cause -the synced flush to fail on that shard. This means that some shards may be synced flushed while others aren't. See below for more. -2. The `sync_id` marker is removed as soon as the shard is flushed again. That is because a flush replaces the low level -lucene commit point where the marker is stored. Uncommitted operations in the transaction log do not remove the marker. -In practice, one should consider any indexing operation on an index as removing the marker as a flush can be triggered by Elasticsearch -at any time. - - -NOTE: It is harmless to request a synced flush while there is ongoing indexing. Shards that are idle will succeed and shards - that are not will fail. Any shards that succeeded will have faster recovery times. +The Synced Flush API allows an administrator to initiate a synced flush +manually. This can be particularly useful for a planned cluster restart where +you can stop indexing but don't want to wait for 5 minutes until all indices +are marked as inactive and automatically sync-flushed. +You can request a synced flush even if there is ongoing indexing activity, and +{es} will perform the synced flush on a "best-effort" basis: shards that do not +have any ongoing indexing activity will be successfully sync-flushed, and other +shards will fail to sync-flush. The successfully sync-flushed shards will have +faster recovery times as long as the `sync_id` marker is not removed by a +subsequent flush. [source,sh] -------------------------------------------------- @@ -146,10 +163,11 @@ POST twitter/_flush/synced // CONSOLE // TEST[setup:twitter] -The response contains details about how many shards were successfully sync-flushed and information about any failure. +The response contains details about how many shards were successfully +sync-flushed and information about any failure. -Here is what it looks like when all shards of a two shards and one replica index successfully -sync-flushed: +Here is what it looks like when all shards of a two shards and one replica +index successfully sync-flushed: [source,js] -------------------------------------------------- @@ -168,7 +186,8 @@ sync-flushed: -------------------------------------------------- // TESTRESPONSE[s/"successful": 2/"successful": 1/] -Here is what it looks like when one shard group failed due to pending operations: +Here is what it looks like when one shard group failed due to pending +operations: [source,js] -------------------------------------------------- @@ -193,11 +212,12 @@ Here is what it looks like when one shard group failed due to pending operations -------------------------------------------------- // NOTCONSOLE -NOTE: The above error is shown when the synced flush fails due to concurrent indexing operations. The HTTP -status code in that case will be `409 CONFLICT`. +NOTE: The above error is shown when the synced flush fails due to concurrent +indexing operations. The HTTP status code in that case will be `409 Conflict`. -Sometimes the failures are specific to a shard copy. The copies that failed will not be eligible for -fast recovery but those that succeeded still will be. This case is reported as follows: +Sometimes the failures are specific to a shard copy. The copies that failed +will not be eligible for fast recovery but those that succeeded still will be. +This case is reported as follows: [source,js] -------------------------------------------------- @@ -230,7 +250,8 @@ fast recovery but those that succeeded still will be. This case is reported as f -------------------------------------------------- // NOTCONSOLE -NOTE: When a shard copy fails to sync-flush, the HTTP status code returned will be `409 CONFLICT`. +NOTE: When a shard copy fails to sync-flush, the HTTP status code returned will +be `409 Conflict`. The synced flush API can be applied to more than one index with a single call, or even on `_all` the indices.