-
Notifications
You must be signed in to change notification settings - Fork 25.2k
CCR: Add TransportService closed to retryable errors #34722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed because they responded to the FollowTask a TransportService closed exception which is currently considered as a fatal error. This behavior is not desirable since a closing node can throw that exception, and we should retry in this case. This change adds TransportService closed error to the list of retryable errors.
Pinging @elastic/es-distributed |
CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.
It looks like it failed because of NodeDisconnectedException
instead:
1> [2018-10-22T22:40:29,638][INFO ][o.e.t.InternalTestCluster] [testFollowIndexAndCloseNode] Closing random non master node [follower0] current master [follower1]
1> [2018-10-22T22:40:29,645][INFO ][o.e.n.Node ] [testFollowIndexAndCloseNode] stopping ...
1> [2018-10-22T22:40:29,671][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
1> org.elasticsearch.transport.NodeDisconnectedException: [leaderm0][127.0.0.1:40071][indices:data/read/xpack/ccr/shard_changes] disconnected
1> [2018-10-22T22:40:29,662][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower1] shard follow task encounter non-retryable error
1> org.elasticsearch.transport.RemoteTransportException: [follower0][127.0.0.1:45630][indices:data/write/bulk_shard_operations[s]]
1> Caused by: org.elasticsearch.transport.SendRequestTransportException: [follower0][127.0.0.1:39049][indices:data/write/bulk_shard_operations[s][p]]
1> at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:670) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:573) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:561) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.performAction(TransportReplicationAction.java:813) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
The IndexNotFoundException
happens in the testDeleteFollowerIndex()
test.
I think we should also retry when NodeDisconnectedException
occurs?
(transportError.getMessage() != null && transportError.getMessage().contains("TransportService is closed"))) { | ||
return true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you add this below then that will work too and it is shorter:
(actual instanceof NodeNotConnectedException && actual.getMessage().contains("TransportService is closed"))
Thanks @martijnvg for looking into the failure. I've addressed your comment. Could you please have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one commit. LGTM
actual instanceof IndexClosedException; // If follow index is closed | ||
actual instanceof IndexClosedException || // If follow index is closed | ||
|
||
actual instanceof NodeDisconnectedException || actual instanceof NodeNotConnectedException || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add actual instanceof NodeNotConnectedException ||
on a newline?
@@ -369,6 +371,7 @@ private void handleFailure(Exception e, AtomicInteger retryCounter, Runnable tas | |||
scheduler.accept(TimeValue.timeValueMillis(delay), task); | |||
} else { | |||
fatalException = ExceptionsHelper.convertToElastic(e); | |||
LOGGER.warn("shard follow task encounter non-retryable error", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg Should we also mark the follow-task as failed in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, recently we made a change that specifically not marks tasks as failed: #34404
If the task is marked as failed then it is removed and there is no trace of it other than in the log file of the node the task was running. By keeping the task we can read the fatal error from the ccr stats api. If fatalException
is set then the task will stop any ongoing operations. The user will need to invoke the pause api in order to get the task removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Thanks @martijnvg for reviewing. |
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed because they responded to the FollowTask a TransportService closed exception which is currently considered as a fatal error. This behavior is not desirable since a closing node can throw that exception, and we should retry in that case. This change adds TransportService closed error to the list of retryable errors. Closes #34694
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed because they responded to the FollowTask a TransportService closed exception which is currently considered as a fatal error. This behavior is not desirable since a closing node can throw that exception, and we should retry in that case. This change adds TransportService closed error to the list of retryable errors. Closes #34694
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in this case.
This change adds TransportService closed error to the list of retryable
errors.
Closes #34694
This approach is quite ugly - I am open to suggestions.