You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)
OS version (uname -a if on a Unix-like system):
Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)
The unfollow call returns immediately with connect_transport_exception exceptions
The unfollow call never returns (i waited for more than an hour)
The cluster shows the unfollow task running but never dies.
This second case seems to happens with indices with number_shards > 1
Expected behavior: both cases returning with status.
Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)
Thanks for the report, and the awesomely-detailed writeup, it's truly helpful. It made reproducing and understanding this bug incredibly easy, which enabled me to quickly fix the bug. 🙏
I analyzed the situation here and found that we have a bug in how a listener that is waiting for multiple callbacks to complete handles failures. In particular, if they all fail for the same reason (such as the same connect transport exception because they were all waiting for a connection to be established, which failed, which caused the multiple callbacks to receive the same instance of the exception), then we would attempt to self-suppress the exception, which is forbidden. This caused an exception to be thrown, which caused the listener waiting for the multiple callbacks to complete to never be invoked, which meant the client was never notified. 🤦♀
Elasticsearch version (
bin/elasticsearch --version
):Version: 7.6.1, Build: default/rpm/aa751e09be0a5072e8570670309b1f12348f023b/2020-02-29T00:15:25.529771Z, JVM: 13.0.2
Plugins installed: []
JVM version (
java -version
):openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)
OS version (
uname -a
if on a Unix-like system):Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)
The unfollow call returns immediately with connect_transport_exception exceptions
The unfollow call never returns (i waited for more than an hour)
The cluster shows the unfollow task running but never dies.
This second case seems to happens with indices with number_shards > 1
Expected behavior: both cases returning with status.
Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)
elasticsearch.yml :
Steps:
#Start trial on both cluster
#remote cluster connection
#Verify remote
#Create leader index with 1 shard on primary cluster
#Create leader index with 2 shards on primary cluster
#Verify new indices
curl -XGET 'http://192.168.1.219:9200/_cat/indices/test*?v' -u elastic:badpassword
#Create followers on secondary cluster
#Verify followers
#Shutdown primary cluster nodes
#Verify ccr status
#Pause following on both indices and verify
#Close
#unfollow
#this call to unfollow return a connect_transport_exception
#this call to unfollow never returns
#Verify long running tasks
Provide logs (if relevant):
secondary-v761-node1s.log
secondary-v761-node2s.log
secondary-v761-node3s.log
The text was updated successfully, but these errors were encountered: