Skip to content

Retry follow task when remote connection queue full #55314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 17, 2020

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 16, 2020

If more than 100 shard-follow tasks are trying to connect to the remote cluster, then some of them will abort with "connect listener queue is full". This is because we retry on ESRejectedExecutionException, but not on RejectedExecutionException.

@dnhatn dnhatn added >bug :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v8.0.0 v7.6.3 v6.8.9 v7.8.0 v7.7.1 labels Apr 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

@@ -105,10 +105,14 @@ public int getNumberOfChannels() {
Setting.Property.NodeScope,
Setting.Property.Dynamic));

// this setting is intentionally not registered, it is only used in tests
public static final Setting<Integer> REMOTE_MAX_CONNECTION_QUEUE_SIZE =
Setting.intSetting("cluster.remote.max_connection_queue_size", 100, Setting.Property.NodeScope);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there was a lot of thought to the connection listener limit. If there is a strong reason to increase it past 100 we could probably do that. Also does does this name make sense? We only allow a single connection round at a time. Should the name be cluster.remote.max_pending_connection_listeners?

Copy link
Member Author

@dnhatn dnhatn Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name be cluster.remote.max_pending_connection_listeners?

++. I renamed it in f9c807f.

I don't think there was a lot of thought to the connection listener limit. If there is a strong reason to increase it past 100 we could probably do that.

Yeah, I think we chose this value quite arbitrarily. I think it's fine to increase this value as we should not have many concurrent remote searches, and CCR will retry on this error anyway. I've increased this to 1000. WDYT?

@dnhatn dnhatn requested a review from Tim-Brooks April 16, 2020 17:30
Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Apr 17, 2020

@tbrooks8 Thanks for reviewing.

@dnhatn dnhatn merged commit 5216bd2 into elastic:master Apr 17, 2020
@dnhatn dnhatn deleted the remote-connect-queue branch April 17, 2020 04:10
dnhatn added a commit that referenced this pull request Apr 17, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit that referenced this pull request Apr 21, 2020
If more than 100 shard-follow tasks are trying to connect to the remote 
cluster, then some of them will abort with "connect listener queue is 
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request May 1, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit that referenced this pull request May 2, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.

Backport of #55314
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants