-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping #53725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/ml-core (:ml) |
@elasticmachine update branch |
@@ -37,12 +38,12 @@ | |||
private final String analyticsId; | |||
private final DataFrameDataExtractor dataExtractor; | |||
private final ResultsPersisterService resultsPersisterService; | |||
private Supplier<Boolean> shouldRetryPersistence = () -> true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be final and passed in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the retry persistence logic depends on information stored within AnalyticsResultProcessor
we would either have to make a Builder or a Factory class that we then pass to AnalyticsResultProcessor
which constructs a new joiner from that passed factory.
This option just seemed simpler and less complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. Please also consider the following approach which might be even less complex:
Add isCancelled
field and cancel()
method to DataFrameRowsJoiner
class. In AnalyticsResultProcessor::cancel()
, you can also cancel the joiner:
public void cancel() {
isCancelled = true;
dataFrameRowsJoiner.cancel();
}
Then there is no need to pass the Supplier
as you've already passed the actual isCancelled
value.
@@ -148,12 +180,28 @@ public SearchResponse searchWithRetry(SearchRequest searchRequest, | |||
} catch (ElasticsearchException e) { | |||
LOGGER.warn("[" + jobId + "] Exception while executing search action", e); | |||
failureMessage = e.getDetailedMessage(); | |||
if (isIrrecoverable(e)) { | |||
LOGGER.warn(new ParameterizedMessage("[{}] experienced irrecoverable failure", jobId), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this message consistent with others? "...failure that cannot be..."
* @param ex The exception to check | ||
* @return true when the failure will persist no matter how many times we retry. | ||
*/ | ||
static boolean isIrrecoverable(Exception ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "private"?
() -> resultsPersisterService.searchWithRetry(SEARCH_REQUEST, JOB_ID, () -> true, (s) -> {})); | ||
assertThat(e.getMessage(), containsString("experienced failure that cannot be automatically retried")); | ||
|
||
verify(client, times(1)).execute(eq(SearchAction.INSTANCE), eq(SEARCH_REQUEST), any()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think times(1)
is default and you can drop it.
… of github.com:benwtrent/elasticsearch into feature/ml-data-frame-analytics-retry-on-failure-fixes
/** | ||
* List of rest statuses that we consider irrecoverable | ||
*/ | ||
public static final Set<RestStatus> IRRECOVERABLE_REST_STATUSES = new HashSet<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since the name is in capitals it should be immutable, i.e. Set.of()
or Collections.unmodifiableSet()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… and stop retrying when analytics job is stopping (elastic#53725) This fixes two issues: - Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem. - Data frame analytics would continue to retry to persist results even after the job is stopped. closes elastic#53687
…ittent and stop retrying when analytics job is stopping (#53725) (#53808) * [ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping (#53725) This fixes two issues: - Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem. - Data frame analytics would continue to retry to persist results even after the job is stopped. closes #53687
This fixes two issues:
closes #53687