-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Rolling restart doc uses persistent but should use transient #27677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also if you already have the transient settings set to all, then setting persistent in Step 1 I assume has no effect because transient takes precedent. |
We suffered from this documentation mistake too. After a system reboot "cluster.routing.allocation.enable" was set to "none" and newly created indices didn't become available. |
I think the bug here is not that the last setting is set transiently, rather than the full cluster restart and rolling restart documentation are in disagreement. Rolling restart, step 7 shows: PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
} Full cluster restart, step 8 shows: PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
} As long as you follow either guide, you should end up with allocation enabled. The only issue here is that a transient setting of I think the fix should be to change the full cluster restart docs to transiently set the setting in step 8 and then close out this issue. |
I agree with this comment, the cluster needs to be brought back to the initial state. Not many users are following the procedure when it comes to a full cluster restart. Usually, they are doing it when they actually upgrade. In other cases, the cluster is just restarted without any allocation being disabled. When this happens and the cluster comes back no new indices will be allocated and then it is realized that allocation was mysteriously disabled sometime in the past. I am in favor of having
|
But when you restart a cluster accidentally, you are in better shape having disabled allocation rather than allowing allocations in an unexpected situation. It's true that it is not the default, but that's because the default when you startup is to let you get to work. Leaving it set as As to transiently setting |
This gh issue was about rolling upgrades and my comment regarding transiently setting it to none applies to rolling upgrades. I agree it is safer to do it persistently and then reverting it with transiently, but you have the downside of not being aware of the persistent one having been applied at some time and assuming that the cluster should get back to "normal" after restart. Which is not happening until the next day (the logging use case with daily indices) when the user notices that day's indices were not created, Kibana visualizations are not working and all the ingesting flows (LS or not) are accumulating events or, worse, dropping them. To me, it looks like we are dropping predictability. And safer it may be, but not always (see the logging use case scenario I described above). |
I think the docs are still wrong for a rolling restart. If I use transient to re-enable, then repeat the procedure for the next node, I am now in the wrong state. If I follow the first step and disable it for persistent, then nothing happens, because it is already disabled, and transient is already enabled. So the docs should either say not to use persistent for a rolling restart at all, or to point out that you need to disable the transient setting for each node. Then you probably also want it back to the normal state after finishing, meaning that persistent should be set to enabled. |
* es/master: (143 commits) Revert "Disable BWC tests for build issues" Remove AcknowledgedRestListener in favour of RestToXContentListener (#28724) Build: Consolidate archives and packages configuration (#28760) Skip some plugins service tests on Windows Migrate some *ResponseTests to AbstractStreamableXContentTestCase (#28749) Disable BWC tests for build issues Ensure that azure stream has socket privileges (#28751) [DOCS] Fixed broken link. Pass InputStream when creating XContent parser (#28754) [DOCS] Changed to use transient setting to reenabled allocation. Closes #27677 Delay path expansion on Windows [TEST] replace randomAsciiAlphanumOfLengthBetween with randomAsciiLettersOfLengthBetween [Tests] Extract the testing logic for Google Cloud Storage (#28576) [Docs] Update links to java9 docs (#28750) version set in ingest pipeline (#27573) Revert "Add startup logging for standalone tests" Tests: don't wait for completion while trying to get completed task Add 5.6.9 snapshot version [Docs] Java high-level REST client : clean up (#28703) Updated distribution outputs in contributing docs ...
* es/6.x: (140 commits) Remove AcknowledgedRestListener in favour of RestToXContentListener (#28724) Migrate some *ResponseTests to AbstractStreamableXContentTestCase (#28749) Revert "Disable BWC tests for build issues" Skip some plugins service tests on Windows Build: Consolidate archives and packages configuration (#28760) Ensure that azure stream has socket privileges (#28773) Disable BWC tests for build issues Pass InputStream when creating XContent parser (#28754) [DOCS] Fixed broken link. [DOCS] Changed to use transient setting to reenabled allocation. Closes #27677 Delay path expansion on Windows [TEST] replace randomAsciiAlphanumOfLengthBetween with randomAsciiLettersOfLengthBetween REST high-level client: add support for Rollover Index API (#28698) [Tests] Extract the testing logic for Google Cloud Storage (#28576) Moved Grok helper code to a separate Gradle module and let ingest-common module depend on it. version set in ingest pipeline (#27573) [Docs] Update links to java9 docs (#28750) Revert "Add startup logging for standalone tests" Tests: don't wait for completion while trying to get completed task Add 5.6.9 snapshot version ...
That's a great point. Rolling restarts should probably specify |
Pinging @elastic/es-distributed |
Elasticsearch version (
bin/elasticsearch --version
):6.0.0
Plugins installed: []
JVM version (
java -version
):OS version (
uname -a
if on a Unix-like system):Description of the problem including expected versus actual behavior:
https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html
Step 1 says:
Disable shard allocation.
When you shut down a node, the allocation process waits for one minute before starting to replicate the shards on that node to other nodes in the cluster, causing a lot of wasted I/O. You can avoid racing the clock by disabling allocation before shutting down the node:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
However this should be transient settings. Later on it says to use transient to re-enable:
Reenable shard allocation.
Once the node has joined the cluster, reenable shard allocation to start using the node:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
Shouldn't you use transient only? Basically if you were to do a full cluster restart at some point (without disabling allocation), the persistent setting would only apply and allocation is disabled. While transient setting takes precedent over persistent, this is not clearly spelled out in the docs. Also I think it can lead to confusion as to how to enable or disable allocation using transient vs persistent.
The text was updated successfully, but these errors were encountered: