Incremental update tooling #40002

markharwood · 2019-03-13T16:33:57Z

Problem

Updating documents via script using _update or _update_by_query is useful but hard to do right.
It's useful because small sets of changes can be applied cheaply (no need to drag full docs to client).
However it's hard to write script because:

People need to understand painless and scripts can become verbose and hard to debug.
Document properties can be unexpected (missing versus single-value versus array)
Document values and params need conversion (strings need parsing to dates).
Common operations like deriving durations between dates is fiddly
Document properties that represent collections are all primitive arrays (need code to convert to Maps, Sets, FIFOs).
Writing idempotent scripts requires care (and idempotent is typically a required feature when syncing batches of changes between indices).

Here are some examples of scenarios where I've used complex scripts to update an existing document with new information:

An entity-centric index may be being used to track web sessions. Derived session properties may include: sessionStart, sessionDuration, entryPage, lastAction and numPagesViewed.
A classifier rule may add keyword tags to matching documents (or remove prior tagging if the rule has changed).

Solution

These sorts of operations could be simplified using a higher-level language designed to update documents. Long-term this may be a new endpoint with a new DSL but much of what is required can be achieved today using a generic painless script driven by the params part of an update request. The proposal is that we experiment with features using this scripted approach and later turn into a Java-based endpoint once we have determined a useful set of operations/syntax.

An example update call to tag a document declaratively may look like this:

POST test/_doc/_update_by_query
{
  "query":{....},
  "script": {
	"id": "incremental_update",
	"params": {
	  "add": [{ "field" : "tags",  "value" : "high-risk"}]
	}
  }
}

This uses an arbitrary query to identify high-risk documents then adds the term high-risk to a field called tags.
Behind the scenes the generic incremental_update script is doing a lot of the work when it sees the add parameter:

It checks the field name is a supplied parameter, returning a suitable error message if not
It checks if the tags field is present on the doc, creating a new array if not.
If the tags field is present but currently a single value it converts into an array (avoids this gotcha)
Adds the term high-risk to the array if missing
Aborts the update if the term high-risk is already listed in the array.

Similar savings can be had with these other commands:

max - record the larger of an existing doc value and a newly observed value
min - record the smaller of an existing doc value and a newly observed value
increment - add a new value to an existing field value
duration -record the difference between two date fields
add - add a term to a set of values
remove - remove a term from a set of values

Being idempotent

In the case of an entity-centric index (e.g. web sessions) which are updated by the latest set of changes from an event-centric index it's important that changes are only applied once. Typically an incremental update is applied as follows:

Use max aggregation to find "last update" timestamp on the entity index ("websessions").
Use the scroll api to query the event-centric store ("weblogs") for the latest events on or before the last update, sorted by entity id (websessionID) and event logged date, ascending.
Use the bulk api to bundle batches of update requests to the entity store (one update per websessionID).

If a failure occurs between 2 and 3 we would need to re-run the batch of changes again which may mean replaying some events already updated in the entity store. This is where it becomes important that the update script is idempotent e.g does not increment numPageViews again. This can be achieved if the update script checks the current event date and a lastUpdated date maintained on the entity document.

Going forward

This issue is possibly of interest to the dataframes work going on in ML (Dataframes are currently a one-off data fusion using aggs and I doubt aggs can underpin data gathering/fusion required for incremental updates).
We can use this issue to define and extend the set of operations we think would be useful to express in a declarative fashion. An initial example of a generic painless script is here.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-03-13T16:33:59Z

Pinging @elastic/es-distributed

elasticmachine · 2019-03-13T16:34:01Z

Pinging @elastic/es-core-infra

ywelsch · 2019-03-14T10:05:07Z

These sorts of operations could be simplified using a higher-level language designed to update documents. Long-term this may be a new endpoint with a new DSL but much of what is required can be achieved today using a generic painless script driven by the params part of an update request. The proposal is that we experiment with features using this scripted approach and later turn into a Java-based endpoint once we have determined a useful set of operations/syntax.

I wonder if ingest processors might be a way to provide such a DSL. Assume we have implemented a generic "tag" processor which abides to the 5 rules that you've described, we could write the following:

PUT _ingest/pipeline/tag-high-risk
{
  "description" : "high risk tagger",
  "processors" : [ {
      "tag" : {
        "field": "tags",
        "value": "high-risk"
      }
  } ]
}
POST test/_update_by_query?pipeline=tag-high-risk
{
  "query":{....}
  ...
}

markharwood · 2019-03-14T14:55:44Z

I wonder if ingest processors might be a way to provide such a DS

We rejected that idea a while ago because of the way we organise data nodes and ingest nodes separately.

markharwood · 2019-03-14T18:03:06Z

This is working example of updating a web session with various properties:

POST test/_doc/my_web_session1/_update
{
  "upsert": {},
  "scripted_upsert": true,
  "script": {
	"id": "incremental_update",
	"params": {
	  "commands": [
		{"add": {"field": "userAgents","value": "IE10"}},
		{"size": {"field": "userAgents", "resultField": "numBrowsers"}},
		{
		  "min": { "field": "start", "value": "now",
			"onChange": [
			  { "set": {"entryPage": "www.foo.com/checkout" }}
			]
		  }
		},
		{
		  "max": { "field": "lastAccess", "value": "now",
			"onChange": [
			  { "set": { "exitPage": "www.foo.com/checkout"}}
			]
		  }
		},
		{"diff": { "minField": "start", "maxField": "lastAccess","diffField": "duration"}},
		{ "increment": {"field": "pageViews"}}
	  ],
	  "lastUpdateField": "lastUpdate"
	}
  }
}

This is updating a session with the latest web log event (a user accessed www.foo.com/checkout using IE10). Various session properties are recalculated as a result but there's various things that I find troubling here:
There's an assumption that the calling client is either:
a) passing one event at a time in the update or
b) has "thinned out" multiple events and applied some client side logic.

If a) then this is inefficient - there's no batching of multiple events into a single update.
If b) then we are complicating the client and duplicating logic - the client would need to know that www.foo.com/homepage is the value to set for the startPage field and that the later www.foo.com/checkout is the value to set for the exitPage field. This client-side logic reduction of events is similar to the server-side script that merges new data into old as part of updating an existing doc.

The alternative is that clients are kept simple and they pass bundles of raw events which are reduced by the server-side script. In this case the scripted commands are run repeatedly for each of the documents bundled in an array e.g.:

POST test/_doc/my_web_session1/_update
{
  "upsert": {},
  "scripted_upsert": true,
  "script": {
	"id": "incremental_update",
	"params": {
	  "data":[
		{ "date":"2019-01-01: 12:00", "url":"www.foo.com/homepage"},
		{ "date":"2019-01-01: 12:01", "url":"www.foo.com/checkout"}
		...
		],
	  "commands": [
		{
		  "min": { "field": "sessionStart", "srcField": "date",
			"onChange": [
			  { "set": {"entryPage": "url" }}
			]
		  }
		},
		{
		  "max": { "field": "lastAccess", "srcField": "date",
			"onChange": [
			  { "set": { "exitPage": "url"}}
			]
		  }
		},
		{"diff": { "minField": "start", "maxField": "lastAccess","diffField": "duration"}},
	  ],
	  "lastUpdateField": "lastUpdate"
	}
  }
}

In this scenario the commands are run for each document in the data section. The commands no longer contain data values but instead field names that reference values in the current document from the data section.

So the summary of possible reduction techniques are:

Approach	How	Concerns
One event per update	Update script is invoked for each new event	Can be costly in terms of number of updates - no batching
Pre-reduced events per update	Client reduces new events prior to sending to update script	Duplication of client and server-side reduction logic
Multiple raw events per update	Client sends raw new events in bundles to update script	Requires multiple passes of update script prior to update
No update scripts	Client pulls session document to client, updates with latest events then reindexes	Requires custom client-side coding

I'm leaning towards this "Multiple raw events per update " option

markharwood · 2019-03-18T11:08:44Z

Entity attributes that we see as being of interest:

Attribute type	Update approach	Example use case
Last known state	Set `keyword` field with status of most recent event	security credentials, number of still-running processes
Last sighting date	Set `date` field with timestamp of most recent event	Expiring web sessions
Last sighted value	Set numeric field with value of most recent logged event	Internet of things summarise current status
Activity summary	Add to Set of terms. (Need to cap size of set so use FIFO, LIFO or PQ)	Music recommendations based on collections of listened-to artists.
Max/min/avg/sum	Update based on a new logged event's numeric value	Web session: sum of all bytes downloaded
Counts	Increment with each logged event or new unique field value.	Session hijack: Web sessions that use >1 UserAgent.
Time window counts	Like ordinary counts but scoped for a time period and reset to zero when elapsed	Hacker blocking: Find ip addresses that use >10 account IDs per day
Ratios	Derive from two numeric values on the entity	Scraper Bot detection: countHTMLdownloads vs countJavascriptDownloads ratio
Diffs	Calculate the change between observed values	Calculate web session durations or find space/time travellers

Each of these mutations may be surrounded with a condition eg we only want to update a count "where status != 200".

markharwood · 2019-03-19T11:33:37Z

Data modelling issue: - representing sessions

When we're not lucky enough to have session IDs we might want to break an entity's activity stream into discrete chunks of activity that we could think of as sessions. Using IP address as an entity ID I was able to use an update script to summarise weblogs and detect new sessions for an IP (where there's at least an hour between current log record and previous logged activity). The question is how best to represent these sessions in elasticsearch?

The trade-off seems to be optimising for ease of update versus ease of query/analysis

Approach	How	Pros	Cons
Date array in entity doc	An update script can detect the start of a new session and drop the date into an array of "sessionStart" dates	Update scripts offer local patching of data	Analysis options are limited - we can show the multiple dates on a date histogram but they can't be used to "zoom in" on multi-valued fields. Also, sessions don't have any other metadata eg duration or number of requests
Session object array in entity doc	An update script can detect the start of a new session and create rich new session objects inside an array called `sessions` which is of type `nested`	Update scripts offer local patching of data. Sessions can have other metadata e.g. duration or number of requests	Analysis is technically possible using `nested` aggs but in practice limited by Kibana's lack of support for nested.
New session documents	A custom client would have to read entity's last session record in order to determine if a new logged event represents a new session and if so create a new session document	Output content is easier to analyse in Kibana	Not as efficient to perform updates (prior history must be transferred to a client before any update can be contemplated)
Time-aligned sessions	The "entity" in our entity-centric index becomes a combination of actor ID (e.g. IP address) and a truncated time stamp (e.g. YYYYMMDDHH)	Simplifies updates and analysis	"True" sessions may be split into two if they straddle top-of-the-hour boundaries.

markharwood · 2019-05-08T16:26:45Z

The latest prototype for a generic upsert script is here with an example of updating websessions with weblog data to maintain session durations and entry and exit pages

ywelsch · 2021-01-06T14:41:55Z

@jdconrad this has both distrib and core/infra label as well as team-discuss label? Are you intending to discuss this issue in the core/infra sync?

jdconrad · 2021-01-06T15:14:19Z

@yannick I think that label may have been added too hastily when we went through all of the scripting issues last month. I'll discuss this within my team first.

rjernst · 2021-01-07T20:09:49Z

This issue has some interesting ideas for a higher level semantic language that makes updates easier. However, in almost 2 years it hasn't gained any traction apart from initial prototyping. Additionally, building this as a layer directly on scripting has long term impact on the ability to change the scripting API, and such a feature may work better as it's own dedicated API (as mentioned within the discussion here). We discussed this issue today in our Painless sync, and we believe the complexity and cost of such a feature is not worthwhile at this time. Thus we are closing this issue, but if there is any future desire for it we can always reopen.

markharwood added >enhancement :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Mar 13, 2019

markharwood self-assigned this Mar 13, 2019

henningandersen mentioned this issue Jul 3, 2019

Allow reindex to do update/upsert operations #17997

Closed

rjernst added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

jdconrad added team-discuss and removed needs:triage Requires assignment of a team area label labels Dec 9, 2020

jdconrad removed the team-discuss label Jan 6, 2021

rjernst closed this as completed Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental update tooling #40002

Incremental update tooling #40002

markharwood commented Mar 13, 2019 •

edited

Loading

elasticmachine commented Mar 13, 2019

elasticmachine commented Mar 13, 2019

ywelsch commented Mar 14, 2019

markharwood commented Mar 14, 2019

markharwood commented Mar 14, 2019

markharwood commented Mar 18, 2019 •

edited

Loading

markharwood commented Mar 19, 2019

markharwood commented May 8, 2019

ywelsch commented Jan 6, 2021

jdconrad commented Jan 6, 2021 •

edited

Loading

rjernst commented Jan 7, 2021

Incremental update tooling #40002

Incremental update tooling #40002

Comments

markharwood commented Mar 13, 2019 • edited Loading

Problem

Solution

Being idempotent

Going forward

elasticmachine commented Mar 13, 2019

elasticmachine commented Mar 13, 2019

ywelsch commented Mar 14, 2019

markharwood commented Mar 14, 2019

markharwood commented Mar 14, 2019

markharwood commented Mar 18, 2019 • edited Loading

markharwood commented Mar 19, 2019

Data modelling issue: - representing sessions

markharwood commented May 8, 2019

ywelsch commented Jan 6, 2021

jdconrad commented Jan 6, 2021 • edited Loading

rjernst commented Jan 7, 2021

markharwood commented Mar 13, 2019 •

edited

Loading

markharwood commented Mar 18, 2019 •

edited

Loading

jdconrad commented Jan 6, 2021 •

edited

Loading