Skip to content

[Transform] Support for data stream as transform destination #62712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mikeh-elastic opened this issue Sep 21, 2020 · 6 comments
Closed

[Transform] Support for data stream as transform destination #62712

mikeh-elastic opened this issue Sep 21, 2020 · 6 comments

Comments

@mikeh-elastic
Copy link

mikeh-elastic commented Sep 21, 2020

Now that data streams are released users will expect to be able to run transforms to write to them as a destination.

When I did not create the data stream in advance but set up the template and let the transform create it, it somehow created an index even though the template was indicating it was supposed to be a data stream.

Creating a data stream manually first via a PUT and setting it as the destination index results in this error in Kibana being presented to the user:

{"msg":"[runtime_exception] runtime_exception: Could not create destination index [my-data-stream] for transform [my-data-stream2]","path":"/_transform/my-data-stream2/_start","query":{},"statusCode":500,"response":"{"error":{"root_cause":[{"type":"runtime_exception","reason":"runtime_exception: Could not create destination index [my-data-stream] for transform [my-data-stream2]"}],"type":"runtime_exception","reason":"runtime_exception: Could not create destination index [my-data-stream] for transform [my-data-stream2]","caused_by":{"type":"illegal_state_exception","reason":"index, alias, and data stream names need to be unique, but the following duplicates were found [data stream [my-data-stream] conflicts with index]"}},"status":500}"}

@mikeh-elastic mikeh-elastic added >enhancement :ml/Transform Transform needs:triage Requires assignment of a team area label labels Sep 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@mikeh-elastic
Copy link
Author

Once initial transform and data stream support is completed there will be the issue of updates to documents with rollover via ILM or other to also solve as per https://www.elastic.co/guide/en/elasticsearch/reference/7.9/use-a-data-stream.html#update-docs-in-a-data-stream-by-query

@hendrikmuhs
Copy link

hendrikmuhs commented Sep 22, 2020

@mikeh-elastic

Transform supports data streams as source (this can be improved, see #58504), data stream as destination is by design problematic. Transform does upserts, meaning it overwrites documents, but a data stream is append only. The classical transform use case is building an entity centric index, a data stream output won't work by design.

However there is 1 usecase that could work: If you have a date_histogram group_by configured with @timestamp, transform could in theory be append only. To make this work, checkpoints have to be aligned with bucket boundaries. This has been requested for other reasons(#61587). I formalized the request and added data stream output as a use case: #62746

I am aware that updates are possible, however this seems contradictory to me. I will try to get some clarification on that. It's technical possible to use this (let transform upsert by query), however this seems complex and error prone.

I will change the title to reflect that we only talk about dest, not data streams in general.

@hendrikmuhs hendrikmuhs changed the title [ML] Make transform work with data streams [Transform] Support for data stream as transform destination Sep 22, 2020
@hendrikmuhs
Copy link

Now that data streams are released users will expect to be able to run transforms against them.

@mikeh-elastic IMHO this is possible, you can run transform against a data stream, you can just not write the output to a data stream.

In order to avoid confusion for readers of this issue, can you please edit your 1st post?

@hendrikmuhs hendrikmuhs added team-discuss and removed needs:triage Requires assignment of a team area label labels Sep 22, 2020
@mikeh-elastic
Copy link
Author

I have updated my initial post to clarify data stream as a destination for a transform is the request.

@hendrikmuhs
Copy link

We discussed this issue in the team.

Data streams are designed for append only, however transform updates data in the destination. Updates require a delete and an insert, this not compatible with append only, therefore transform by design can not write into a data stream[1].

As a result of the discussion we added a note to the documentation.

[1] We discussed the update by query approach, this would create a lot of complexity and although it is technically possible, we think data streams are and should be append only (similarly you could write directly to the backing indices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants