Skip to content

unfollow call not returning when remote cluster is not available #53174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jepert opened this issue Mar 5, 2020 · 2 comments · Fixed by #53262
Closed

unfollow call not returning when remote cluster is not available #53174

jepert opened this issue Mar 5, 2020 · 2 comments · Fixed by #53262
Labels
>bug :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features

Comments

@jepert
Copy link

jepert commented Mar 5, 2020

Elasticsearch version (bin/elasticsearch --version):

Version: 7.6.1, Build: default/rpm/aa751e09be0a5072e8570670309b1f12348f023b/2020-02-29T00:15:25.529771Z, JVM: 13.0.2

Plugins installed: []

JVM version (java -version):

openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)

  1. The unfollow call returns immediately with connect_transport_exception exceptions

  2. The unfollow call never returns (i waited for more than an hour)
    The cluster shows the unfollow task running but never dies.
    This second case seems to happens with indices with number_shards > 1

Expected behavior: both cases returning with status.

Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)

elasticsearch.yml :

cluster.name: primary
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: _eth0_
http.port: 9200
discovery.seed_hosts: ["192.168.1.219", "192.168.1.222", "192.168.1.223"]
cluster.initial_master_nodes: ["node1", "node2", "node3"]
gateway.recover_after_nodes: 2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

Steps:

#Start trial on both cluster

curl -X POST "192.168.1.219:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword
curl -X POST "192.168.1.226:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword

#remote cluster connection

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster": {
      "remote": {
        "remote-primary": {
          "seeds": [
            "192.168.1.219:9300"
          ],
          "transport.ping_schedule": "30s"
        }
        }
      }
    }
  }
'

#Verify remote

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword 

#Create leader index with 1 shard on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-11?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Create leader index with 2 shards on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-21?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 2, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Verify new indices

curl -XGET 'http://192.168.1.219:9200/_cat/indices/test*?v' -u elastic:badpassword

#Create followers on secondary cluster

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-11/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-11"
}
'


curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-21/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-21"
}
'

#Verify followers

curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty' -u elastic:badpassword

#Shutdown primary cluster nodes

systemctl stop elasticsearch

#Verify ccr status

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty&filter_path=follow_stats.indices.shards.read_exceptions.*' -u elastic:badpassword

#Pause following on both indices and verify

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/pause_follow?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/pause_follow?pretty"
curl -XGET  'http://192.168.1.226:9200/testing-*/_ccr/info?pretty' -u elastic:badpassword

#Close

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_close?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_close?pretty"
curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword

#unfollow

#this call to unfollow return a connect_transport_exception

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/unfollow?pretty"

#this call to unfollow never returns

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/unfollow?pretty"

#Verify long running tasks

curl -XGET  'http://192.168.1.226:9200/_cat/tasks?v' -u elastic:badpassword
action                           task_id                      parent_task_id              type      start_time    timestamp running_time ip            node
indices:admin/xpack/ccr/unfollow V-frNb4_Sc2RTGZ_uYne-g:665   -                           transport 1583414050796 13:14:10  2h           192.168.1.226 node3s
indices:admin/xpack/ccr/unfollow dm-h5Zx7TxCUoOvn1I_DWQ:1208  V-frNb4_Sc2RTGZ_uYne-g:665  transport 1583414050798 13:14:10  2h           192.168.1.224 node2s
cluster:monitor/tasks/lists      V-frNb4_Sc2RTGZ_uYne-g:8609  -                           transport 1583421254522 15:14:14  756.7micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   V-frNb4_Sc2RTGZ_uYne-g:8610  V-frNb4_Sc2RTGZ_uYne-g:8609 direct    1583421254523 15:14:14  241.5micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   dm-h5Zx7TxCUoOvn1I_DWQ:16577 V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254530 15:14:14  247.2micros  192.168.1.224 node2s
cluster:monitor/tasks/lists[n]   YIzbKYCoQ-SDyJTkAVSBQg:8917  V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254533 15:14:14  343.3micros  192.168.1.225 node1s

Provide logs (if relevant):

secondary-v761-node1s.log

secondary-v761-node2s.log

secondary-v761-node3s.log

@jasontedor
Copy link
Member

jasontedor commented Mar 8, 2020

Thanks for the report, and the awesomely-detailed writeup, it's truly helpful. It made reproducing and understanding this bug incredibly easy, which enabled me to quickly fix the bug. 🙏

I analyzed the situation here and found that we have a bug in how a listener that is waiting for multiple callbacks to complete handles failures. In particular, if they all fail for the same reason (such as the same connect transport exception because they were all waiting for a connection to be established, which failed, which caused the multiple callbacks to receive the same instance of the exception), then we would attempt to self-suppress the exception, which is forbidden. This caused an exception to be thrown, which caused the listener waiting for the multiple callbacks to complete to never be invoked, which meant the client was never notified. 🤦‍♀

I opened #53262 to address this issue.

@jasontedor jasontedor added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >bug labels Mar 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants