-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Incremental update tooling #40002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed |
Pinging @elastic/es-core-infra |
I wonder if ingest processors might be a way to provide such a DSL. Assume we have implemented a generic "tag" processor which abides to the 5 rules that you've described, we could write the following:
|
We rejected that idea a while ago because of the way we organise data nodes and ingest nodes separately. |
This is working example of updating a web session with various properties:
This is updating a session with the latest web log event (a user accessed If a) then this is inefficient - there's no batching of multiple events into a single update. The alternative is that clients are kept simple and they pass bundles of raw events which are reduced by the server-side script. In this case the scripted commands are run repeatedly for each of the documents bundled in an array e.g.:
In this scenario the commands are run for each document in the So the summary of possible reduction techniques are:
I'm leaning towards this "Multiple raw events per update " option |
Entity attributes that we see as being of interest:
Each of these mutations may be surrounded with a condition eg we only want to update a count "where status != 200". |
Data modelling issue: - representing sessionsWhen we're not lucky enough to have session IDs we might want to break an entity's activity stream into discrete chunks of activity that we could think of as sessions. Using IP address as an entity ID I was able to use an update script to summarise weblogs and detect new sessions for an IP (where there's at least an hour between current log record and previous logged activity). The question is how best to represent these sessions in elasticsearch? The trade-off seems to be optimising for ease of update versus ease of query/analysis
|
The latest prototype for a generic upsert script is here with an example of updating websessions with weblog data to maintain session durations and entry and exit pages |
@jdconrad this has both distrib and core/infra label as well as team-discuss label? Are you intending to discuss this issue in the core/infra sync? |
@yannick I think that label may have been added too hastily when we went through all of the scripting issues last month. I'll discuss this within my team first. |
This issue has some interesting ideas for a higher level semantic language that makes updates easier. However, in almost 2 years it hasn't gained any traction apart from initial prototyping. Additionally, building this as a layer directly on scripting has long term impact on the ability to change the scripting API, and such a feature may work better as it's own dedicated API (as mentioned within the discussion here). We discussed this issue today in our Painless sync, and we believe the complexity and cost of such a feature is not worthwhile at this time. Thus we are closing this issue, but if there is any future desire for it we can always reopen. |
Problem
Updating documents via script using
_update
or_update_by_query
is useful but hard to do right.It's useful because small sets of changes can be applied cheaply (no need to drag full docs to client).
However it's hard to write script because:
Here are some examples of scenarios where I've used complex scripts to update an existing document with new information:
Solution
These sorts of operations could be simplified using a higher-level language designed to update documents. Long-term this may be a new endpoint with a new DSL but much of what is required can be achieved today using a generic painless script driven by the
params
part of an update request. The proposal is that we experiment with features using this scripted approach and later turn into a Java-based endpoint once we have determined a useful set of operations/syntax.An example update call to tag a document declaratively may look like this:
This uses an arbitrary query to identify high-risk documents then adds the term
high-risk
to a field calledtags
.Behind the scenes the generic
incremental_update
script is doing a lot of the work when it sees theadd
parameter:field
name is a supplied parameter, returning a suitable error message if nottags
field is present on the doc, creating a new array if not.tags
field is present but currently a single value it converts into an array (avoids this gotcha)high-risk
to the array if missinghigh-risk
is already listed in the array.Similar savings can be had with these other commands:
max
- record the larger of an existing doc value and a newly observed valuemin
- record the smaller of an existing doc value and a newly observed valueincrement
- add a new value to an existing field valueduration
-record the difference between two date fieldsadd
- add a term to a set of valuesremove
- remove a term from a set of valuesBeing idempotent
In the case of an entity-centric index (e.g. web sessions) which are updated by the latest set of changes from an event-centric index it's important that changes are only applied once. Typically an incremental update is applied as follows:
max
aggregation to find "last update" timestamp on the entity index ("websessions").scroll
api to query the event-centric store ("weblogs") for the latest events on or before the last update, sorted by entity id (websessionID) and event logged date, ascending.bulk
api to bundle batches ofupdate
requests to the entity store (one update per websessionID).If a failure occurs between 2 and 3 we would need to re-run the batch of changes again which may mean replaying some events already updated in the entity store. This is where it becomes important that the update script is idempotent e.g does not increment
numPageViews
again. This can be achieved if the update script checks the current event date and alastUpdated
date maintained on the entity document.Going forward
This issue is possibly of interest to the dataframes work going on in ML (Dataframes are currently a one-off data fusion using aggs and I doubt aggs can underpin data gathering/fusion required for incremental updates).
We can use this issue to define and extend the set of operations we think would be useful to express in a declarative fashion. An initial example of a generic painless script is here.
The text was updated successfully, but these errors were encountered: