Cluster stuck for few mins blocked by zen-disco-node-left #46909

entrop-tankos · 2019-09-20T10:11:20Z

Elasticsearch version: 6.8.2
Cluster: a huge one, 360 data nodes, 60 coordinators, 40 masters. About ~200tb of data
Plugins installed: none
JVM version: 1.8.0_102-b14
OS: CentOS 7.6

A few words about the cluster:
I'm driving a big elasticsearch cluster across 4 data centers with 3 of them with data nodes (120 each).
Indexes I'm storing contain 180 primary and 180 replica shards. Each replica shard is stored in a different data center,
so it's ok for the cluster to stay yellow if it looses a data center for some reason.

Now about the bug:
When a data center goes down, nodes allocated there stop to respond (145 nodes).

Right after "pulling the plug" in one data center the master behaves like this:
It detects 1 node as down (1 of 145) and creates about 140 pending tasks to commit to the cluster that this node is gone.
And these tasks become blockers. While the master is waiting for respond from dead nodes it doesn't mark currently lost primary shards as stale. This produces
a huge delay for all indexing operations on the cluster (3-5mins).

How this can be reproduces:

Here is the start point:

curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "name",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 460,
  "number_of_data_nodes" : 360,
  "active_primary_shards" : 5400,
  "active_shards" : 10800,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Now I disconnect 1 of 4 data centers (just the 120 data nodes):

curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "graylog",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 459,
  "number_of_data_nodes" : 359,
  "active_primary_shards" : 5400,
  "active_shards" : 10770,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 30,
  "delayed_unassigned_shards" : 30,
  "number_of_pending_tasks" : 145,
  "number_of_in_flight_fetch" : 240,
  "task_max_waiting_in_queue_millis" : 27831,
  "active_shards_percent_as_number" : 99.72222222222223
}

I have 30 shards per node. You can see here that it detected 1 node and 30 shards as down, but there are much more nodes down.

Pending tasks:

curl -s -X GET "localhost:9200/_cluster/pending_tasks?pretty"
{
  "tasks" : [
    {
      "insert_order" : 129525,
      "priority" : "IMMEDIATE",
      "source" : "zen-disco-node-left({50.data.elasticsearch.dc.domain.org}{s9fwQI5KSoegZuulZV8mUA}{oOFnfDL6RQS_UbnC-CnIOg}{10.21.131.177}{10.21.131.177:9300}{zone=dc, xpack.installed=true}), reason(left)",
      "executing" : true,
      "time_in_queue_millis" : 38637,
      "time_in_queue" : "38.6s"
    },
    {
      "insert_order" : 129526,
      "priority" : "IMMEDIATE",
      "source" : "zen-disco-node-left({45.data.elasticsearch.dc.domain.org}{WOanEsQcSLeYyNwMjAikzQ}{A6BeDVPqQH28ClYYNe-cbQ}{10.21.131.172}{10.21.131.172:9300}{zone=dc, xpack.installed=true}), reason(left)",
      "executing" : false,
      "time_in_queue_millis" : 38626,
      "time_in_queue" : "38.6s"
    },

.........
about 140 tasks with big time_in_queue and "zen-disco-node-left"
.........

 ]
}

In about 1-2 mins the "zen-disco-node-left" tasks are gone and the cluster is able to accept new documents.

What would be cool to face instead of this behavior:

As far as I understand, the tasks are processed by the TaskBatcher in a single thread one after another.
It would be very cool to detect dead nodes in a async way and to cancel pending tasks for this nodes.

The text was updated successfully, but these errors were encountered:

albertzaharovits · 2019-09-20T11:30:22Z

@entrop-tankos Thanks for reaching out. In general we prefer to keep GitHub for confirmed bug reports and use discuss.elastic.co to talk about issues.

What in essence you're describing is that a node-left cluster state publication is delayed by other node-left events. I vaguely remember @DaveCTurner made some improvements about this some time ago, so I will kindly ask him for any comments/pointers, before closing this issue. If you don't get a response here, please open a discuss issue.

elasticmachine · 2019-09-20T11:31:58Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-20T12:31:35Z

This sounds like the situation fixed by #39629, although it's possible that #40150 will also help.

You can perhaps mitigate some of the delays by reducing tcp_retries2 and transport.connect_timeout but the real fix is to upgrade to at least 7.2 (ideally 7.4 once that's released) and confirm whether the problem persists or not.

If you'd like to discuss further then please start a thread on the discussion forum. If it turns out that this isn't addressed in more recent versions then of course we can reopen this issue, but for now I will close this.

entrop-tankos · 2019-11-29T15:23:28Z

just upgraded to 6.8.5

Any chance to see this fix backported to 6.8? We use graylog and there is no support for 7.X right now.

DaveCTurner · 2019-12-02T10:22:00Z

No, I do not anticipate backporting any of these changes to 6.8.

albertzaharovits added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Sep 20, 2019

DaveCTurner closed this as completed Sep 20, 2019

entrop-tankos mentioned this issue Feb 10, 2020

Elasticsearch 7 Support Graylog2/graylog2-server#5933

Closed

entrop-tankos mentioned this issue Apr 17, 2020

Graylog doesn't process messages under certain conditions Graylog2/graylog2-server#7906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster stuck for few mins blocked by zen-disco-node-left #46909

Cluster stuck for few mins blocked by zen-disco-node-left #46909

entrop-tankos commented Sep 20, 2019

albertzaharovits commented Sep 20, 2019

Uh oh!

elasticmachine commented Sep 20, 2019

Uh oh!

DaveCTurner commented Sep 20, 2019 •

edited

Loading

Uh oh!

entrop-tankos commented Nov 29, 2019

Uh oh!

DaveCTurner commented Dec 2, 2019

Uh oh!

Cluster stuck for few mins blocked by zen-disco-node-left #46909

Cluster stuck for few mins blocked by zen-disco-node-left #46909

Comments

entrop-tankos commented Sep 20, 2019

albertzaharovits commented Sep 20, 2019

Uh oh!

elasticmachine commented Sep 20, 2019

Uh oh!

DaveCTurner commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

entrop-tankos commented Nov 29, 2019

Uh oh!

DaveCTurner commented Dec 2, 2019

Uh oh!

DaveCTurner commented Sep 20, 2019 •

edited

Loading