[ML][Data Frame] Add update transform api endpoint #45154

benwtrent · 2019-08-02T18:29:25Z

This PR adds support for updating continuous data frame transforms.

Update restrictions:

Decided to not allow pivot config updates. This could drastically change the transform.
Updates only allowed on continuous transforms as these are the only ones where this makes sense
Cannot change a continuous transform to a batch transform
All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before _update could be called, but decided against it as it is not strictly necessary.

First commit is the backend change.
Second commit is the HLRC + Docs change.

closes #43438

elasticmachine · 2019-08-02T18:29:28Z

Pinging @elastic/ml-core

docs/java-rest/high-level/dataframe/update_data_frame.asciidoc

Co-Authored-By: Lisa Cawley <[email protected]>

docs/java-rest/high-level/dataframe/update_data_frame.asciidoc

lcawl

LGTM with a few suggestions. The Java Client and Elasticsearch Reference build successfully.

…elasticsearch into feature/ml-df-add-_update-api

benwtrent · 2019-08-02T20:27:28Z

@elasticmachine update branch

dimitris-athanasiou · 2019-08-05T08:26:18Z

All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before _update could be called, but decided against it as it is not strictly necessary.

The downside of this approach is that after the update and before the restart the config does not match what is actually running currently. This may become very confusing for debugging problems where some user makes an update to a running transform but no one restarts it for a while.

It is also interesting to examine what we have done for AD jobs & datafeeds. For jobs we have allowed updates for running jobs and indeed some of the changes will be applied only after a restart. However, I believe the reason this is the case is that historically job-update begun with updating fields of the job that are not used in the analyses (e.g. description). I think this tipped us towards the lenient approach.

On the other hand, for datafeeds we do not allow updates unless the datafeed is stopped. Datafeeds' config params are all used to determine how the feed operates meaning that for an update to apply a restart is mandatory. I think that tipped us to the strict approach.

I am not strongly opposing the lenient approach. I just think it is a crucial decision and try to provide as much info and context as possible to make the right call.

hendrikmuhs

some comments

...rg/elasticsearch/xpack/core/dataframe/action/UpdateDataFrameTransformActionRequestTests.java

hendrikmuhs · 2019-08-05T08:08:39Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/dataframe/DataFrameMessages.java

@@ -26,7 +26,8 @@
    public static final String REST_PUT_DATA_FRAME_DEST_SINGLE_INDEX = "Destination index [{0}] should refer to a single index";
    public static final String REST_PUT_DATA_FRAME_INCONSISTENT_ID =
            "Inconsistent id; ''{0}'' specified in the body differs from ''{1}'' specified as a URL argument";
-
+    public static final String REST_UPDATE_DATA_FRAME_TRANSFORM_BATCH =
+        "Transform with id [{0}] cannot be updated. Updating is only supported on continuous transforms";


whats the reason for this limitation?

Colored in the details in overall comment response.

...rc/main/java/org/elasticsearch/xpack/core/dataframe/transforms/DataFrameTransformConfig.java

hendrikmuhs · 2019-08-05T08:20:27Z

...n/java/org/elasticsearch/xpack/dataframe/action/TransportUpdateDataFrameTransformAction.java

+        String transformId = request.getId();
+
+        // GET transform and attempt to update
+        dataFrameTransformsConfigManager.getTransformConfiguration(request.getId(), ActionListener.wrap(


ideally update should use optimistic locking, which means you save the doc version and send it together with the index request for the update

hendrikmuhs · 2019-08-05T08:21:46Z

...n/java/org/elasticsearch/xpack/dataframe/action/TransportUpdateDataFrameTransformAction.java

+                    listener.onResponse(new Response(config));
+                    return;
+                }
+                if (config.getSyncConfig() == null) {


as said before, I do not see why we limit it to continuous.

hendrikmuhs · 2019-08-05T08:24:25Z

...a/org/elasticsearch/xpack/core/dataframe/transforms/DataFrameTransformConfigUpdateTests.java

+        assertThat(updatedConfig.getFrequency(), equalTo(frequency));
+        assertThat(updatedConfig.getSyncConfig(), equalTo(syncConfig));
+        assertThat(updatedConfig.getDescription(), equalTo(newDescription));
+    }


it would be good to add tests regarding header handling

hendrikmuhs · 2019-08-05T08:58:09Z

Update restrictions:

* Decided to not allow pivot config updates. This could drastically change the transform.

👍 Makes sense to me at the moment (with other sync options this might change), we should rather have clone support in the UI

I had a bad feeling regarding changing dest, but there might be cases where you use aliases or pipelines, so although you change dest it might end in the same index.

Updating some fields can introduce problems, so we should have some documentation about it. As a follow up I see the need to "reset" a data frame transform, meaning running a full bootstrap again.

* Updates only allowed on continuous transforms as these are the only ones where this makes sense

I think this restrictions is confusing, you might want to update the description.

* Cannot change a continuous transform to a batch transform

Can we generalize? You can not change the sync method but you are only allowed to change parameters within sync

* All options (except for description) require the transform to stop/start. I debated requiring the transform to be stopped before `_update` could be called, but decided against it as it is not strictly necessary.

I think this is confusing. We should have a better story for that. E.g. we could check for a new config after every checkpoint (I 1st thought about a runtime update but this has many pitfalls, applying a new config after a checkpoint is also easy to explain and consistent)

First commit is the backend change.
Second commit is the HLRC + Docs change.

closes #43438

The original issue talks about changing the pivot config which we do not implement with this PR. Nevertheless I think it makes sense to close the issue but ask the creator to create new issues for still missing functionality, although we might not agree to implement those at the moment.

benwtrent · 2019-08-05T13:29:01Z

For updates being only for continuous:

The reason this makes sense to me is that batch should NOT be long lived transforms. It is a better work flow to simply re-create it instead of attempting to update it. This is especially if we want to move users towards consistency with their data. Also, if we add a "re-run" option, users may be tempted to update + rerun a batch transform. To me, that complicates the API and users should just create a new one.

Generalizing the sync update:

For sure, I can attempt to address it in this PR. But, since we only had one type of synchronization, I thought it too complex to add to an already 2500 line PR.

we could check for a new config after every checkpoint

That makes sense to me. The action should be rather lightweight as it is literally a GET doc with no searching involved. We will probably want to fail on ResourceNotFound failures, and log + continue on any other failure.

…elasticsearch into feature/ml-df-add-_update-api

…-_update-api

benwtrent · 2019-08-06T00:15:48Z

run elasticsearch-ci/2

hendrikmuhs

The usecase for updating batch is updating the description, at least until we have a re-run functionality, might still be useful in collaborative environments. Let's chat about it later.

hendrikmuhs · 2019-08-06T06:21:17Z

...-tests/src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFrameTransformIT.java

+        deleteDataFrameTransform(config.getId());
+    }
+
+    public void testContinuousDataFrameTransformUpdate() throws Exception {


...ain/java/org/elasticsearch/xpack/dataframe/persistence/DataFrameTransformsConfigManager.java

hendrikmuhs · 2019-08-06T06:39:44Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

@@ -763,7 +773,30 @@ protected void onFinish(ActionListener<Void> listener) {
                logger.debug(
                    "Finished indexing for data frame transform [" + transformTask.getTransformId() + "] checkpoint [" + checkpoint + "]");
                auditBulkFailures = true;
-                listener.onResponse(null);
+                if (isContinuous()) {


maybe onStarted would be the better place? Imagine frequency is set to 1h, a checkpoint is done, the transform is set back to started (idle) and you do the update call and then the scheduler starts the next checkpoint. You would have to wait another round to retrieve the update.

@hendrikmuhs onStarted MAY work? My thought here was onStarted is called on the initial start as well. Meaning, the config was just recently gathered. We might as well not even attempt to gather the config in the executor if we are going to grab the config on started?

I am not sure what users will expect if they have a long frequency. Also, with an exceptionally long frequency, we may want to provide the ability for a _run_now type of endpoint.

@hendrikmuhs I moved the reload to onStarted as you suggested.

hendrikmuhs · 2019-08-06T06:41:47Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+                if (isContinuous()) {
+                    transformsConfigManager.getTransformConfiguration(getJobId(), ActionListener.wrap(
+                        config -> {
+                            transformConfig = config;


can you add a debug log?

hendrikmuhs · 2019-08-06T09:41:42Z

I just realized a conceptual problem of my suggestion yesterday to reload the config after a checkpoint

What if you change the dest index: If you stop/start we would re-create dest. But with reloading we lack the creation of the destination index as well as deducing the mappings for it. Should the update API create dest if it does not exist?

benwtrent · 2019-08-06T12:36:23Z

@hendrikmuhs good point, I thought about this towards the end of yesterday while testing the endpoint out.

I opted to create the destination from deduced mappings if it does not exist AND the transform has started (i.e. the persistent task exists). If the task does not exist, that means it is stopped, and the destination index will be created when the user calls _start.

This should be OK as PUT, _update, and _start are all master node actions.

…-_update-api

…update

hendrikmuhs

LGTM

Nice addition!

This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`. This PR contains all that is necessary for this addition: * HLRC * Docs * Server side

…5279) * [ML][Data Frame] Add update transform api endpoint (#45154) This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`. This PR contains all that is necessary for this addition: * HLRC * Docs * Server side

benwtrent added 2 commits August 1, 2019 16:42

[ML][Data Frame] Add update transform api endpoint

ff43149

[ML][Data Frame] Adding update api docs and HLRC

95ea776

benwtrent added >enhancement v8.0.0 :ml/Transform Transform v7.4.0 labels Aug 2, 2019

lcawl reviewed Aug 2, 2019

View reviewed changes