-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Configuration to customize discovery/zen/fd/master_ping #36822
Comments
Pinging @elastic/es-distributed |
@DaveCTurner not sure if it makes sense to consider a zen proposal such as this, given zen2 progress. |
@kimxogus Might be worth taking a look at Elastic's own helm chart -- currently in alpha status -- for Elasticsearch and esp. the clustering and node discovery approach. |
We certainly won't fix this as described - the fault detection and master election mechanisms are completely changing for 7.0 as described in #32006 - but I do think we can do better in this situation. Marking this for team discussion. The proposal doesn't actually fix the problem described anyway, because it's not a pinging problem:
I think the actual problem here is #29025, but a more orderly master handover process would also help. |
On Linux, reducing |
Reducing |
and log in elastic's own chart with
|
I do not understand what these messages have to do with the original post, or how you managed to get them. The OP was talking about shutting down a master, but if the master were shut down then it'd never respond, so that's not how these messages arose. Also these requests timed out after 3 seconds, and Elasticsearch reacted to the timeout at that time. |
Could you share logs from both the old, stopping, master and the newly-elected master for the time period from when the old master stopped until the new master was elected and the cluster has fully recovered? |
@DaveCTurner I created test master cluster with and logs with
old master(master-0) took SIGTERM about 2018-12-21T07:46:16,521. and outage was about 1 minute. |
Other settings are default values in original chart and image is official image. |
Thanks, the logs were helpful. The issue you are facing is related to #29025: the first cluster state update from the new master causes all the nodes to try and re-establish their connections to the old master, expecting this either to succeed or fail immediately. However Docker's network doesn't behave as expected: if the container has completely gone away, connection attempts receive no response and eventually time out. Worse, we try twice before continuing, so it takes two connection timeouts (each 30 seconds by default) before the cluster proceeds. I would reset your |
Duplicates #29025. |
Thank you 👍 |
Describe the feature:
In kubernetes environment, ip of each member node in cluster are assigned to a pod which is a docker container. When a pod(node) is terminated. you will have a ping timeout to old master address as newly created pod(node) will have a different ip address. In this situation, cluster outage occurs for `discovery.zen.join_timeout` * 20 times(as [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election)) which will be more than a minute. Reducing `ping_timeout` lower than 1 second is too dangerous(may have a problem in master-election) and waiting for several seconds after SIGTERM to elasticsearch for maintaining pod ip for ping doesn't seem to be a proper solution. As [this discussion](https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590), I believe that adding a config option to make elasticsearch skip pinging and waiting for old master before new master will be a good solution.
Elasticsearch version (
bin/elasticsearch --version
): 6.2.3Plugins installed: [
ingest-geoip
,ingest-user-agent
,repository-s3
]JVM version (
java -version
):OS version (
uname -a
if on a Unix-like system):Linux {HOSTNAME} 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
respond to http requests about 1 minute(with
discovery.zen.ping_timeout=3s
anddiscovery.zen.fd.ping_timeout=3s
).Provide logs (if relevant):
The text was updated successfully, but these errors were encountered: