Rolling restart doc uses persistent but should use transient #27677

msimos · 2017-12-05T20:48:18Z

Elasticsearch version (bin/elasticsearch --version):

6.0.0

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

Step 1 says:

Disable shard allocation.

When you shut down a node, the allocation process waits for one minute before starting to replicate the shards on that node to other nodes in the cluster, causing a lot of wasted I/O. You can avoid racing the clock by disabling allocation before shutting down the node:

PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}

However this should be transient settings. Later on it says to use transient to re-enable:

Reenable shard allocation.

Once the node has joined the cluster, reenable shard allocation to start using the node:

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}

Shouldn't you use transient only? Basically if you were to do a full cluster restart at some point (without disabling allocation), the persistent setting would only apply and allocation is disabled. While transient setting takes precedent over persistent, this is not clearly spelled out in the docs. Also I think it can lead to confusion as to how to enable or disable allocation using transient vs persistent.

The text was updated successfully, but these errors were encountered:

msimos · 2017-12-05T20:49:42Z

Also if you already have the transient settings set to all, then setting persistent in Step 1 I assume has no effect because transient takes precedent.

voegelas · 2018-01-10T12:52:03Z

We suffered from this documentation mistake too. After a system reboot "cluster.routing.allocation.enable" was set to "none" and newly created indices didn't become available.

pickypg · 2018-02-20T19:01:30Z

I think the bug here is not that the last setting is set transiently, rather than the full cluster restart and rolling restart documentation are in disagreement.

Rolling restart, step 7 shows:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Full cluster restart, step 8 shows:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

As long as you follow either guide, you should end up with allocation enabled. The only issue here is that a transient setting of "all" is lost following a full cluster restart, but that is a good thing as it's already the first step in a full cluster restart -- so setting it transiently protects you from forgetting to set it to "none" as the first step.

I think the fix should be to change the full cluster restart docs to transiently set the setting in step 8 and then close out this issue.

elastic#27677

astefan · 2018-02-21T04:26:51Z

I agree with this comment, the cluster needs to be brought back to the initial state. Not many users are following the procedure when it comes to a full cluster restart. Usually, they are doing it when they actually upgrade.

In other cases, the cluster is just restarted without any allocation being disabled. When this happens and the cluster comes back no new indices will be allocated and then it is realized that allocation was mysteriously disabled sometime in the past. I am in favor of having

"transient": {
"cluster.routing.allocation.enable": "none"
}

pickypg · 2018-02-21T06:35:20Z

But when you restart a cluster accidentally, you are in better shape having disabled allocation rather than allowing allocations in an unexpected situation.

It's true that it is not the default, but that's because the default when you startup is to let you get to work. Leaving it set as none leaves you in a safer state whether you intentionally or accidentally restart a node: no allocations happen without you giving permission.

As to transiently setting none, you cannot do this for full cluster restarts because the master disappears and thus allocation happens immediately upon cluster reformation (which is the problem that setting it persistently, and all transiently addresses).

astefan · 2018-02-21T08:24:38Z

This gh issue was about rolling upgrades and my comment regarding transiently setting it to none applies to rolling upgrades.

I agree it is safer to do it persistently and then reverting it with transiently, but you have the downside of not being aware of the persistent one having been applied at some time and assuming that the cluster should get back to "normal" after restart. Which is not happening until the next day (the logging use case with daily indices) when the user notices that day's indices were not created, Kibana visualizations are not working and all the ingesting flows (LS or not) are accumulating events or, worse, dropping them.

To me, it looks like we are dropping predictability. And safer it may be, but not always (see the logging use case scenario I described above).

#27677

jarro2783 · 2018-02-22T07:28:02Z

I think the docs are still wrong for a rolling restart. If I use transient to re-enable, then repeat the procedure for the next node, I am now in the wrong state. If I follow the first step and disable it for persistent, then nothing happens, because it is already disabled, and transient is already enabled. So the docs should either say not to use persistent for a rolling restart at all, or to point out that you need to disable the transient setting for each node.

Then you probably also want it back to the normal state after finishing, meaning that persistent should be set to enabled.

* es/master: (143 commits) Revert "Disable BWC tests for build issues" Remove AcknowledgedRestListener in favour of RestToXContentListener (#28724) Build: Consolidate archives and packages configuration (#28760) Skip some plugins service tests on Windows Migrate some *ResponseTests to AbstractStreamableXContentTestCase (#28749) Disable BWC tests for build issues Ensure that azure stream has socket privileges (#28751) [DOCS] Fixed broken link. Pass InputStream when creating XContent parser (#28754) [DOCS] Changed to use transient setting to reenabled allocation. Closes #27677 Delay path expansion on Windows [TEST] replace randomAsciiAlphanumOfLengthBetween with randomAsciiLettersOfLengthBetween [Tests] Extract the testing logic for Google Cloud Storage (#28576) [Docs] Update links to java9 docs (#28750) version set in ingest pipeline (#27573) Revert "Add startup logging for standalone tests" Tests: don't wait for completion while trying to get completed task Add 5.6.9 snapshot version [Docs] Java high-level REST client : clean up (#28703) Updated distribution outputs in contributing docs ...

* es/6.x: (140 commits) Remove AcknowledgedRestListener in favour of RestToXContentListener (#28724) Migrate some *ResponseTests to AbstractStreamableXContentTestCase (#28749) Revert "Disable BWC tests for build issues" Skip some plugins service tests on Windows Build: Consolidate archives and packages configuration (#28760) Ensure that azure stream has socket privileges (#28773) Disable BWC tests for build issues Pass InputStream when creating XContent parser (#28754) [DOCS] Fixed broken link. [DOCS] Changed to use transient setting to reenabled allocation. Closes #27677 Delay path expansion on Windows [TEST] replace randomAsciiAlphanumOfLengthBetween with randomAsciiLettersOfLengthBetween REST high-level client: add support for Rollover Index API (#28698) [Tests] Extract the testing logic for Google Cloud Storage (#28576) Moved Grok helper code to a separate Gradle module and let ingest-common module depend on it. version set in ingest pipeline (#27573) [Docs] Update links to java9 docs (#28750) Revert "Add startup logging for standalone tests" Tests: don't wait for completion while trying to get completed task Add 5.6.9 snapshot version ...

pickypg · 2018-02-22T19:15:44Z

@jarro2783

That's a great point. Rolling restarts should probably specify none as both persistent and transient (or, for simple matters, just one or the other). At the very least, we must specify in the docs to remove the transient setting between node changes.

elastic#27677

elasticmachine · 2018-04-24T11:59:49Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-08-24T14:55:19Z

We updated the docs to recommend using persistent throughout the full cluster restart docs in #29670, and in the rolling upgrade docs in #29671, so I think this issue is resolved.

msimos added the >docs General docs changes label Dec 5, 2017

ywelsch assigned nik9000 and debadair Jan 22, 2018

gmoskovicz mentioned this issue Feb 20, 2018

[DOC] Wrong disable/re-enable shard allocation commands. #28747

Closed

jasontedor mentioned this issue Feb 20, 2018

[Docs] Rolling upgrade use "transient" instead "persistent" for disabling shard allocation #27860

Closed

debadair added a commit to debadair/elasticsearch that referenced this issue Feb 21, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

bad35a9

elastic#27677

debadair closed this as completed in 260d68a Feb 21, 2018

debadair added a commit that referenced this issue Feb 21, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

65175e3

#27677

debadair added a commit that referenced this issue Feb 21, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

86085fa

#27677

debadair added a commit that referenced this issue Feb 21, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

f0be0bf

#27677

debadair added a commit that referenced this issue Feb 21, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

5281c85

#27677

pickypg reopened this Feb 22, 2018

sebasjm pushed a commit to sebasjm/elasticsearch that referenced this issue Mar 10, 2018

[DOCS] Changed to use transient setting to reenabled allocation. Closes

d74914d

elastic#27677

colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. label Apr 24, 2018

debadair mentioned this issue May 22, 2018

Rolling upgrade, use transient settings to disable shard allocation elastic/docs#343

Closed

DaveCTurner added the team-discuss label Aug 24, 2018

DaveCTurner closed this as completed Aug 24, 2018

DaveCTurner removed the team-discuss label Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling restart doc uses persistent but should use transient #27677

Rolling restart doc uses persistent but should use transient #27677

msimos commented Dec 5, 2017

msimos commented Dec 5, 2017

voegelas commented Jan 10, 2018

pickypg commented Feb 20, 2018

astefan commented Feb 21, 2018

pickypg commented Feb 21, 2018

astefan commented Feb 21, 2018

jarro2783 commented Feb 22, 2018

pickypg commented Feb 22, 2018

elasticmachine commented Apr 24, 2018

DaveCTurner commented Aug 24, 2018

Rolling restart doc uses persistent but should use transient #27677

Rolling restart doc uses persistent but should use transient #27677

Comments

msimos commented Dec 5, 2017

msimos commented Dec 5, 2017

voegelas commented Jan 10, 2018

pickypg commented Feb 20, 2018

astefan commented Feb 21, 2018

pickypg commented Feb 21, 2018

astefan commented Feb 21, 2018

jarro2783 commented Feb 22, 2018

pickypg commented Feb 22, 2018

elasticmachine commented Apr 24, 2018

DaveCTurner commented Aug 24, 2018