Skip to content

[ML][Data Frame] Add update transform api endpoint #45154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 7, 2019

Conversation

benwtrent
Copy link
Member

This PR adds support for updating continuous data frame transforms.

Update restrictions:

  • Decided to not allow pivot config updates. This could drastically change the transform.
  • Updates only allowed on continuous transforms as these are the only ones where this makes sense
  • Cannot change a continuous transform to a batch transform
  • All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before _update could be called, but decided against it as it is not strictly necessary.

First commit is the backend change.
Second commit is the HLRC + Docs change.

closes #43438

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@lcawl lcawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few suggestions. The Java Client and Elasticsearch Reference build successfully.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor

All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before _update could be called, but decided against it as it is not strictly necessary.

The downside of this approach is that after the update and before the restart the config does not match what is actually running currently. This may become very confusing for debugging problems where some user makes an update to a running transform but no one restarts it for a while.

It is also interesting to examine what we have done for AD jobs & datafeeds. For jobs we have allowed updates for running jobs and indeed some of the changes will be applied only after a restart. However, I believe the reason this is the case is that historically job-update begun with updating fields of the job that are not used in the analyses (e.g. description). I think this tipped us towards the lenient approach.

On the other hand, for datafeeds we do not allow updates unless the datafeed is stopped. Datafeeds' config params are all used to determine how the feed operates meaning that for an update to apply a restart is mandatory. I think that tipped us to the strict approach.

I am not strongly opposing the lenient approach. I just think it is a crucial decision and try to provide as much info and context as possible to make the right call.

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

@@ -26,7 +26,8 @@
public static final String REST_PUT_DATA_FRAME_DEST_SINGLE_INDEX = "Destination index [{0}] should refer to a single index";
public static final String REST_PUT_DATA_FRAME_INCONSISTENT_ID =
"Inconsistent id; ''{0}'' specified in the body differs from ''{1}'' specified as a URL argument";

public static final String REST_UPDATE_DATA_FRAME_TRANSFORM_BATCH =
"Transform with id [{0}] cannot be updated. Updating is only supported on continuous transforms";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the reason for this limitation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colored in the details in overall comment response.

String transformId = request.getId();

// GET transform and attempt to update
dataFrameTransformsConfigManager.getTransformConfiguration(request.getId(), ActionListener.wrap(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally update should use optimistic locking, which means you save the doc version and send it together with the index request for the update

listener.onResponse(new Response(config));
return;
}
if (config.getSyncConfig() == null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said before, I do not see why we limit it to continuous.

assertThat(updatedConfig.getFrequency(), equalTo(frequency));
assertThat(updatedConfig.getSyncConfig(), equalTo(syncConfig));
assertThat(updatedConfig.getDescription(), equalTo(newDescription));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to add tests regarding header handling

@hendrikmuhs
Copy link

Update restrictions:

* Decided to not allow pivot config updates. This could drastically change the transform.

👍 Makes sense to me at the moment (with other sync options this might change), we should rather have clone support in the UI

I had a bad feeling regarding changing dest, but there might be cases where you use aliases or pipelines, so although you change dest it might end in the same index.

Updating some fields can introduce problems, so we should have some documentation about it. As a follow up I see the need to "reset" a data frame transform, meaning running a full bootstrap again.

* Updates only allowed on continuous transforms as these are the only ones where this makes sense

I think this restrictions is confusing, you might want to update the description.

* Cannot change a continuous transform to a batch transform

Can we generalize? You can not change the sync method but you are only allowed to change parameters within sync

* All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before `_update` could be called, but decided against it as it is not strictly necessary.

I think this is confusing. We should have a better story for that. E.g. we could check for a new config after every checkpoint (I 1st thought about a runtime update but this has many pitfalls, applying a new config after a checkpoint is also easy to explain and consistent)

First commit is the backend change.
Second commit is the HLRC + Docs change.

closes #43438

The original issue talks about changing the pivot config which we do not implement with this PR. Nevertheless I think it makes sense to close the issue but ask the creator to create new issues for still missing functionality, although we might not agree to implement those at the moment.

@benwtrent
Copy link
Member Author

For updates being only for continuous:

The reason this makes sense to me is that batch should NOT be long lived transforms. It is a better work flow to simply re-create it instead of attempting to update it. This is especially if we want to move users towards consistency with their data. Also, if we add a "re-run" option, users may be tempted to update + rerun a batch transform. To me, that complicates the API and users should just create a new one.

Generalizing the sync update:

For sure, I can attempt to address it in this PR. But, since we only had one type of synchronization, I thought it too complex to add to an already 2500 line PR.

we could check for a new config after every checkpoint

That makes sense to me. The action should be rather lightweight as it is literally a GET doc with no searching involved. We will probably want to fail on ResourceNotFound failures, and log + continue on any other failure.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecase for updating batch is updating the description, at least until we have a re-run functionality, might still be useful in collaborative environments. Let's chat about it later.

deleteDataFrameTransform(config.getId());
}

public void testContinuousDataFrameTransformUpdate() throws Exception {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -763,7 +773,30 @@ protected void onFinish(ActionListener<Void> listener) {
logger.debug(
"Finished indexing for data frame transform [" + transformTask.getTransformId() + "] checkpoint [" + checkpoint + "]");
auditBulkFailures = true;
listener.onResponse(null);
if (isContinuous()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe onStarted would be the better place? Imagine frequency is set to 1h, a checkpoint is done, the transform is set back to started (idle) and you do the update call and then the scheduler starts the next checkpoint. You would have to wait another round to retrieve the update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs onStarted MAY work? My thought here was onStarted is called on the initial start as well. Meaning, the config was just recently gathered. We might as well not even attempt to gather the config in the executor if we are going to grab the config on started?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what users will expect if they have a long frequency. Also, with an exceptionally long frequency, we may want to provide the ability for a _run_now type of endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs I moved the reload to onStarted as you suggested.

if (isContinuous()) {
transformsConfigManager.getTransformConfiguration(getJobId(), ActionListener.wrap(
config -> {
transformConfig = config;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a debug log?

@hendrikmuhs
Copy link

hendrikmuhs commented Aug 6, 2019

I just realized a conceptual problem of my suggestion yesterday to reload the config after a checkpoint

What if you change the dest index: If you stop/start we would re-create dest. But with reloading we lack the creation of the destination index as well as deducing the mappings for it. Should the update API create dest if it does not exist?

@benwtrent
Copy link
Member Author

@hendrikmuhs good point, I thought about this towards the end of yesterday while testing the endpoint out.

I opted to create the destination from deduced mappings if it does not exist AND the transform has started (i.e. the persistent task exists). If the task does not exist, that means it is stopped, and the destination index will be created when the user calls _start.

This should be OK as PUT, _update, and _start are all master node actions.

@benwtrent benwtrent requested a review from hendrikmuhs August 6, 2019 15:35
Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice addition!

@benwtrent benwtrent merged commit 1da7c59 into elastic:master Aug 7, 2019
@benwtrent benwtrent deleted the feature/ml-df-add-_update-api branch August 7, 2019 12:28
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 7, 2019
This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.

This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
benwtrent added a commit that referenced this pull request Aug 7, 2019
…5279)

* [ML][Data Frame] Add update transform api endpoint (#45154)

This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.

This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Make Updating Data Frame Transforms Easier
7 participants