CCR: Add TransportService closed to retryable errors #34722

dnhatn · 2018-10-22T20:16:30Z

Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in this case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694

This approach is quite ugly - I am open to suggestions.

Both testFollowIndexAndCloseNode and testFailOverOnFollower failed because they responded to the FollowTask a TransportService closed exception which is currently considered as a fatal error. This behavior is not desirable since a closing node can throw that exception, and we should retry in this case. This change adds TransportService closed error to the list of retryable errors.

elasticmachine · 2018-10-22T21:09:54Z

Pinging @elastic/es-distributed

dnhatn · 2018-10-22T23:07:09Z

CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.

1> [2018-10-22T22:40:25,275][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> org.elasticsearch.index.IndexNotFoundException: no such index
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:182) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]

martijnvg

CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.

It looks like it failed because of NodeDisconnectedException instead:

1> [2018-10-22T22:40:29,638][INFO ][o.e.t.InternalTestCluster] [testFollowIndexAndCloseNode] Closing random non master node [follower0] current master [follower1] 
  1> [2018-10-22T22:40:29,645][INFO ][o.e.n.Node               ] [testFollowIndexAndCloseNode] stopping ...
  1> [2018-10-22T22:40:29,671][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> org.elasticsearch.transport.NodeDisconnectedException: [leaderm0][127.0.0.1:40071][indices:data/read/xpack/ccr/shard_changes] disconnected
  1> [2018-10-22T22:40:29,662][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower1] shard follow task encounter non-retryable error
  1> org.elasticsearch.transport.RemoteTransportException: [follower0][127.0.0.1:45630][indices:data/write/bulk_shard_operations[s]]
  1> Caused by: org.elasticsearch.transport.SendRequestTransportException: [follower0][127.0.0.1:39049][indices:data/write/bulk_shard_operations[s][p]]
  1>    at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:670) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:573) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:561) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.performAction(TransportReplicationAction.java:813) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]

The IndexNotFoundException happens in the testDeleteFollowerIndex() test.

I think we should also retry when NodeDisconnectedException occurs?

martijnvg · 2018-10-23T07:19:14Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                    (transportError.getMessage() != null && transportError.getMessage().contains("TransportService is closed"))) {
+                    return true;
+                }
+            }


I think if you add this below then that will work too and it is shorter:

(actual instanceof NodeNotConnectedException && actual.getMessage().contains("TransportService is closed"))

dnhatn · 2018-10-23T12:37:02Z

Thanks @martijnvg for looking into the failure. I've addressed your comment. Could you please have another look?

martijnvg

I left one commit. LGTM

martijnvg · 2018-10-23T12:42:45Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

-            actual instanceof IndexClosedException; // If follow index is closed
+            actual instanceof IndexClosedException || // If follow index is closed
+
+            actual instanceof NodeDisconnectedException || actual instanceof NodeNotConnectedException ||


add actual instanceof NodeNotConnectedException || on a newline?

martijnvg · 2018-10-23T12:43:04Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

@@ -369,6 +371,7 @@ private void handleFailure(Exception e, AtomicInteger retryCounter, Runnable tas
            scheduler.accept(TimeValue.timeValueMillis(delay), task);
        } else {
            fatalException = ExceptionsHelper.convertToElastic(e);
+            LOGGER.warn("shard follow task encounter non-retryable error", e);


@martijnvg Should we also mark the follow-task as failed in this case?

No, recently we made a change that specifically not marks tasks as failed: #34404

If the task is marked as failed then it is removed and there is no trace of it other than in the log file of the node the task was running. By keeping the task we can read the fatal error from the ccr stats api. If fatalException is set then the task will stop any ongoing operations. The user will need to invoke the pause api in order to get the task removed.

Good point.

dnhatn · 2018-10-23T18:22:51Z

Thanks @martijnvg for reviewing.

Both testFollowIndexAndCloseNode and testFailOverOnFollower failed because they responded to the FollowTask a TransportService closed exception which is currently considered as a fatal error. This behavior is not desirable since a closing node can throw that exception, and we should retry in that case. This change adds TransportService closed error to the list of retryable errors. Closes #34694

dnhatn added >non-issue :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Oct 22, 2018

dnhatn requested review from martijnvg and jasontedor October 22, 2018 20:16

martijnvg reviewed Oct 23, 2018

View reviewed changes

dnhatn added 2 commits October 23, 2018 08:34

reuse unwrap result

194f74a

Merge branch 'master' into ccr-retry

15fd448

dnhatn requested a review from martijnvg October 23, 2018 12:37

martijnvg approved these changes Oct 23, 2018

View reviewed changes

newline

1f13509

dnhatn merged commit e242fd2 into elastic:master Oct 23, 2018

dnhatn deleted the ccr-retry branch October 23, 2018 18:23

dnhatn added the backport pending label Oct 23, 2018

dnhatn removed the backport pending label Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCR: Add TransportService closed to retryable errors #34722

CCR: Add TransportService closed to retryable errors #34722

dnhatn commented Oct 22, 2018

elasticmachine commented Oct 22, 2018

dnhatn commented Oct 22, 2018

martijnvg left a comment

martijnvg Oct 23, 2018

dnhatn commented Oct 23, 2018

martijnvg left a comment

martijnvg Oct 23, 2018

martijnvg Oct 23, 2018

dnhatn Oct 23, 2018

martijnvg Oct 23, 2018

dnhatn Oct 23, 2018

dnhatn commented Oct 23, 2018

CCR: Add TransportService closed to retryable errors #34722

CCR: Add TransportService closed to retryable errors #34722

Conversation

dnhatn commented Oct 22, 2018

elasticmachine commented Oct 22, 2018

dnhatn commented Oct 22, 2018

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Oct 23, 2018

Choose a reason for hiding this comment

dnhatn commented Oct 23, 2018

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Oct 23, 2018

Choose a reason for hiding this comment

martijnvg Oct 23, 2018

Choose a reason for hiding this comment

dnhatn Oct 23, 2018

Choose a reason for hiding this comment

martijnvg Oct 23, 2018

Choose a reason for hiding this comment

dnhatn Oct 23, 2018

Choose a reason for hiding this comment

dnhatn commented Oct 23, 2018