unfollow call not returning when remote cluster is not available #53174

jepert · 2020-03-05T15:21:19Z

Elasticsearch version (bin/elasticsearch --version):

Version: 7.6.1, Build: default/rpm/aa751e09be0a5072e8570670309b1f12348f023b/2020-02-29T00:15:25.529771Z, JVM: 13.0.2

Plugins installed: []

JVM version (java -version):

openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)

The unfollow call returns immediately with connect_transport_exception exceptions
The unfollow call never returns (i waited for more than an hour)
The cluster shows the unfollow task running but never dies.
This second case seems to happens with indices with number_shards > 1

Expected behavior: both cases returning with status.

Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)

elasticsearch.yml :

cluster.name: primary
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: _eth0_
http.port: 9200
discovery.seed_hosts: ["192.168.1.219", "192.168.1.222", "192.168.1.223"]
cluster.initial_master_nodes: ["node1", "node2", "node3"]
gateway.recover_after_nodes: 2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

Steps:

#Start trial on both cluster

curl -X POST "192.168.1.219:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword
curl -X POST "192.168.1.226:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword

#remote cluster connection

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster": {
      "remote": {
        "remote-primary": {
          "seeds": [
            "192.168.1.219:9300"
          ],
          "transport.ping_schedule": "30s"
        }
        }
      }
    }
  }
'

#Verify remote

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword

#Create leader index with 1 shard on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-11?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Create leader index with 2 shards on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-21?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 2, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Verify new indices

curl -XGET 'http://192.168.1.219:9200/_cat/indices/test*?v' -u elastic:badpassword

#Create followers on secondary cluster

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-11/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-11"
}
'


curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-21/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-21"
}
'

#Verify followers

curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty' -u elastic:badpassword

#Shutdown primary cluster nodes

systemctl stop elasticsearch

#Verify ccr status

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty&filter_path=follow_stats.indices.shards.read_exceptions.*' -u elastic:badpassword

#Pause following on both indices and verify

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/pause_follow?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/pause_follow?pretty"
curl -XGET  'http://192.168.1.226:9200/testing-*/_ccr/info?pretty' -u elastic:badpassword

#Close

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_close?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_close?pretty"
curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword

#unfollow

#this call to unfollow return a connect_transport_exception

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/unfollow?pretty"

#this call to unfollow never returns

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/unfollow?pretty"

#Verify long running tasks

curl -XGET  'http://192.168.1.226:9200/_cat/tasks?v' -u elastic:badpassword

action                           task_id                      parent_task_id              type      start_time    timestamp running_time ip            node
indices:admin/xpack/ccr/unfollow V-frNb4_Sc2RTGZ_uYne-g:665   -                           transport 1583414050796 13:14:10  2h           192.168.1.226 node3s
indices:admin/xpack/ccr/unfollow dm-h5Zx7TxCUoOvn1I_DWQ:1208  V-frNb4_Sc2RTGZ_uYne-g:665  transport 1583414050798 13:14:10  2h           192.168.1.224 node2s
cluster:monitor/tasks/lists      V-frNb4_Sc2RTGZ_uYne-g:8609  -                           transport 1583421254522 15:14:14  756.7micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   V-frNb4_Sc2RTGZ_uYne-g:8610  V-frNb4_Sc2RTGZ_uYne-g:8609 direct    1583421254523 15:14:14  241.5micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   dm-h5Zx7TxCUoOvn1I_DWQ:16577 V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254530 15:14:14  247.2micros  192.168.1.224 node2s
cluster:monitor/tasks/lists[n]   YIzbKYCoQ-SDyJTkAVSBQg:8917  V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254533 15:14:14  343.3micros  192.168.1.225 node1s

Provide logs (if relevant):

secondary-v761-node1s.log

secondary-v761-node2s.log

secondary-v761-node3s.log

The text was updated successfully, but these errors were encountered:

jasontedor · 2020-03-08T00:24:35Z

Thanks for the report, and the awesomely-detailed writeup, it's truly helpful. It made reproducing and understanding this bug incredibly easy, which enabled me to quickly fix the bug. 🙏

I analyzed the situation here and found that we have a bug in how a listener that is waiting for multiple callbacks to complete handles failures. In particular, if they all fail for the same reason (such as the same connect transport exception because they were all waiting for a connection to be established, which failed, which caused the multiple callbacks to receive the same instance of the exception), then we would attempt to self-suppress the exception, which is forbidden. This caused an exception to be thrown, which caused the listener waiting for the multiple callbacks to complete to never be invoked, which meant the client was never notified. 🤦‍♀

I opened #53262 to address this issue.

elasticmachine · 2020-03-08T00:46:16Z

Pinging @elastic/es-distributed (:Distributed/CCR)

jasontedor mentioned this issue Mar 8, 2020

Avoid self-suppression on grouped action listener #53262

Merged

jasontedor added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >bug labels Mar 8, 2020

jasontedor closed this as completed in #53262 Mar 8, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unfollow call not returning when remote cluster is not available #53174

unfollow call not returning when remote cluster is not available #53174

jepert commented Mar 5, 2020

jasontedor commented Mar 8, 2020 •

edited

Loading

elasticmachine commented Mar 8, 2020

unfollow call not returning when remote cluster is not available #53174

unfollow call not returning when remote cluster is not available #53174

Comments

jepert commented Mar 5, 2020

jasontedor commented Mar 8, 2020 • edited Loading

elasticmachine commented Mar 8, 2020

jasontedor commented Mar 8, 2020 •

edited

Loading