Skip to content

[ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping #53725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

benwtrent
Copy link
Member

This fixes two issues:

  • Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
  • Data frame analytics would continue to retry to persist results even after the job is stopped.

closes #53687

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@@ -37,12 +38,12 @@
private final String analyticsId;
private final DataFrameDataExtractor dataExtractor;
private final ResultsPersisterService resultsPersisterService;
private Supplier<Boolean> shouldRetryPersistence = () -> true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be final and passed in the constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the retry persistence logic depends on information stored within AnalyticsResultProcessor we would either have to make a Builder or a Factory class that we then pass to AnalyticsResultProcessor which constructs a new joiner from that passed factory.

This option just seemed simpler and less complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. Please also consider the following approach which might be even less complex:

Add isCancelled field and cancel() method to DataFrameRowsJoiner class. In AnalyticsResultProcessor::cancel(), you can also cancel the joiner:

    public void cancel() {
        isCancelled = true;
        dataFrameRowsJoiner.cancel();
    }

Then there is no need to pass the Supplier as you've already passed the actual isCancelled value.

@@ -148,12 +180,28 @@ public SearchResponse searchWithRetry(SearchRequest searchRequest,
} catch (ElasticsearchException e) {
LOGGER.warn("[" + jobId + "] Exception while executing search action", e);
failureMessage = e.getDetailedMessage();
if (isIrrecoverable(e)) {
LOGGER.warn(new ParameterizedMessage("[{}] experienced irrecoverable failure", jobId), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this message consistent with others? "...failure that cannot be..."

* @param ex The exception to check
* @return true when the failure will persist no matter how many times we retry.
*/
static boolean isIrrecoverable(Exception ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "private"?

() -> resultsPersisterService.searchWithRetry(SEARCH_REQUEST, JOB_ID, () -> true, (s) -> {}));
assertThat(e.getMessage(), containsString("experienced failure that cannot be automatically retried"));

verify(client, times(1)).execute(eq(SearchAction.INSTANCE), eq(SEARCH_REQUEST), any());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think times(1) is default and you can drop it.

… of github.com:benwtrent/elasticsearch into feature/ml-data-frame-analytics-retry-on-failure-fixes
@benwtrent benwtrent requested a review from przemekwitek March 18, 2020 16:49
/**
* List of rest statuses that we consider irrecoverable
*/
public static final Set<RestStatus> IRRECOVERABLE_REST_STATUSES = new HashSet<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since the name is in capitals it should be immutable, i.e. Set.of() or Collections.unmodifiableSet().

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 783df3f into elastic:master Mar 19, 2020
@benwtrent benwtrent deleted the feature/ml-data-frame-analytics-retry-on-failure-fixes branch March 19, 2020 15:16
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Mar 19, 2020
… and stop retrying when analytics job is stopping (elastic#53725)

This fixes two issues:


- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.

closes elastic#53687
benwtrent added a commit that referenced this pull request Mar 19, 2020
…ittent and stop retrying when analytics job is stopping (#53725) (#53808)

* [ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping (#53725)

This fixes two issues:


- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.

closes #53687
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Writing Results Retries Continue After Analytics Job Stopped
5 participants