Skip to content

CCR: Add TransportService closed to retryable errors #34722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 23, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 22, 2018

Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in this case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694

This approach is quite ugly - I am open to suggestions.

Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in this case.

This change adds TransportService closed error to the list of retryable
errors.
@dnhatn dnhatn added >non-issue :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Oct 22, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Oct 22, 2018

CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.

1> [2018-10-22T22:40:25,275][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> org.elasticsearch.index.IndexNotFoundException: no such index
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:182) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failed because we hit IndexNotFoundException (not retryable) when a node is being closed.

It looks like it failed because of NodeDisconnectedException instead:

1> [2018-10-22T22:40:29,638][INFO ][o.e.t.InternalTestCluster] [testFollowIndexAndCloseNode] Closing random non master node [follower0] current master [follower1] 
  1> [2018-10-22T22:40:29,645][INFO ][o.e.n.Node               ] [testFollowIndexAndCloseNode] stopping ...
  1> [2018-10-22T22:40:29,671][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> org.elasticsearch.transport.NodeDisconnectedException: [leaderm0][127.0.0.1:40071][indices:data/read/xpack/ccr/shard_changes] disconnected
  1> [2018-10-22T22:40:29,662][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower1] shard follow task encounter non-retryable error
  1> org.elasticsearch.transport.RemoteTransportException: [follower0][127.0.0.1:45630][indices:data/write/bulk_shard_operations[s]]
  1> Caused by: org.elasticsearch.transport.SendRequestTransportException: [follower0][127.0.0.1:39049][indices:data/write/bulk_shard_operations[s][p]]
  1>    at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:670) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:573) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:561) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.performAction(TransportReplicationAction.java:813) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]

The IndexNotFoundException happens in the testDeleteFollowerIndex() test.

I think we should also retry when NodeDisconnectedException occurs?

(transportError.getMessage() != null && transportError.getMessage().contains("TransportService is closed"))) {
return true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you add this below then that will work too and it is shorter:

(actual instanceof NodeNotConnectedException && actual.getMessage().contains("TransportService is closed"))

@dnhatn
Copy link
Member Author

dnhatn commented Oct 23, 2018

Thanks @martijnvg for looking into the failure. I've addressed your comment. Could you please have another look?

@dnhatn dnhatn requested a review from martijnvg October 23, 2018 12:37
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one commit. LGTM

actual instanceof IndexClosedException; // If follow index is closed
actual instanceof IndexClosedException || // If follow index is closed

actual instanceof NodeDisconnectedException || actual instanceof NodeNotConnectedException ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add actual instanceof NodeNotConnectedException || on a newline?

@@ -369,6 +371,7 @@ private void handleFailure(Exception e, AtomicInteger retryCounter, Runnable tas
scheduler.accept(TimeValue.timeValueMillis(delay), task);
} else {
fatalException = ExceptionsHelper.convertToElastic(e);
LOGGER.warn("shard follow task encounter non-retryable error", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg Should we also mark the follow-task as failed in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, recently we made a change that specifically not marks tasks as failed: #34404

If the task is marked as failed then it is removed and there is no trace of it other than in the log file of the node the task was running. By keeping the task we can read the fatal error from the ccr stats api. If fatalException is set then the task will stop any ongoing operations. The user will need to invoke the pause api in order to get the task removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 23, 2018

Thanks @martijnvg for reviewing.

@dnhatn dnhatn merged commit e242fd2 into elastic:master Oct 23, 2018
@dnhatn dnhatn deleted the ccr-retry branch October 23, 2018 18:23
dnhatn added a commit that referenced this pull request Oct 24, 2018
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in that case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694
kcm pushed a commit that referenced this pull request Oct 30, 2018
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in that case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexFollowingIT test failures
3 participants