Skip to content

CCR: Add TransportService closed to retryable errors #34722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 23, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
import org.elasticsearch.indices.IndexClosedException;
import org.elasticsearch.persistent.AllocatedPersistentTask;
import org.elasticsearch.tasks.TaskId;
import org.elasticsearch.transport.NodeDisconnectedException;
import org.elasticsearch.transport.NodeNotConnectedException;
import org.elasticsearch.xpack.ccr.action.bulk.BulkShardOperationsResponse;
import org.elasticsearch.xpack.core.ccr.ShardFollowNodeTaskStatus;

Expand Down Expand Up @@ -369,6 +371,7 @@ private void handleFailure(Exception e, AtomicInteger retryCounter, Runnable tas
scheduler.accept(TimeValue.timeValueMillis(delay), task);
} else {
fatalException = ExceptionsHelper.convertToElastic(e);
LOGGER.warn("shard follow task encounter non-retryable error", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg Should we also mark the follow-task as failed in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, recently we made a change that specifically not marks tasks as failed: #34404

If the task is marked as failed then it is removed and there is no trace of it other than in the log file of the node the task was running. By keeping the task we can read the fatal error from the ccr stats api. If fatalException is set then the task will stop any ongoing operations. The user will need to invoke the pause api in order to get the task removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

}
}

Expand Down Expand Up @@ -397,7 +400,10 @@ static boolean shouldRetry(Exception e) {
actual instanceof AlreadyClosedException ||
actual instanceof ElasticsearchSecurityException || // If user does not have sufficient privileges
actual instanceof ClusterBlockException || // If leader index is closed or no elected master
actual instanceof IndexClosedException; // If follow index is closed
actual instanceof IndexClosedException || // If follow index is closed
actual instanceof NodeDisconnectedException ||
actual instanceof NodeNotConnectedException ||
(actual.getMessage() != null && actual.getMessage().contains("TransportService is closed"));
}

// These methods are protected for testing purposes:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,6 @@ public void afterBulk(long executionId, BulkRequest request, Throwable failure)
assertMaxSeqNoOfUpdatesIsTransferred(resolveLeaderIndex("index1"), resolveFollowerIndex("index2"), numberOfShards);
}

@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/34696")
public void testFollowIndexAndCloseNode() throws Exception {
getFollowerCluster().ensureAtLeastNumDataNodes(3);
String leaderIndexSettings = getIndexSettings(3, 1, singletonMap(IndexSettings.INDEX_SOFT_DELETES_SETTING.getKey(), "true"));
Expand Down Expand Up @@ -619,7 +618,6 @@ public void testUnfollowIndex() throws Exception {
assertThat(followerClient().prepareSearch("index2").get().getHits().getTotalHits(), equalTo(2L));
}

@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/34696")
public void testFailOverOnFollower() throws Exception {
int numberOfReplicas = between(1, 2);
getFollowerCluster().startMasterOnlyNode();
Expand Down