Skip to content

Incremental update tooling #40002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markharwood opened this issue Mar 13, 2019 · 11 comments
Closed

Incremental update tooling #40002

markharwood opened this issue Mar 13, 2019 · 11 comments
Assignees
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@markharwood
Copy link
Contributor

markharwood commented Mar 13, 2019

Problem

Updating documents via script using _update or _update_by_query is useful but hard to do right.
It's useful because small sets of changes can be applied cheaply (no need to drag full docs to client).
However it's hard to write script because:

  1. People need to understand painless and scripts can become verbose and hard to debug.
  2. Document properties can be unexpected (missing versus single-value versus array)
  3. Document values and params need conversion (strings need parsing to dates).
  4. Common operations like deriving durations between dates is fiddly
  5. Document properties that represent collections are all primitive arrays (need code to convert to Maps, Sets, FIFOs).
  6. Writing idempotent scripts requires care (and idempotent is typically a required feature when syncing batches of changes between indices).

Here are some examples of scenarios where I've used complex scripts to update an existing document with new information:

  • An entity-centric index may be being used to track web sessions. Derived session properties may include: sessionStart, sessionDuration, entryPage, lastAction and numPagesViewed.
  • A classifier rule may add keyword tags to matching documents (or remove prior tagging if the rule has changed).

Solution

These sorts of operations could be simplified using a higher-level language designed to update documents. Long-term this may be a new endpoint with a new DSL but much of what is required can be achieved today using a generic painless script driven by the params part of an update request. The proposal is that we experiment with features using this scripted approach and later turn into a Java-based endpoint once we have determined a useful set of operations/syntax.

An example update call to tag a document declaratively may look like this:

POST test/_doc/_update_by_query
{
  "query":{....},
  "script": {
	"id": "incremental_update",
	"params": {
	  "add": [{ "field" : "tags",  "value" : "high-risk"}]
	}
  }
}	

This uses an arbitrary query to identify high-risk documents then adds the term high-risk to a field called tags.
Behind the scenes the generic incremental_update script is doing a lot of the work when it sees the add parameter:

  1. It checks the field name is a supplied parameter, returning a suitable error message if not
  2. It checks if the tags field is present on the doc, creating a new array if not.
  3. If the tags field is present but currently a single value it converts into an array (avoids this gotcha)
  4. Adds the term high-risk to the array if missing
  5. Aborts the update if the term high-risk is already listed in the array.

Similar savings can be had with these other commands:

  • max - record the larger of an existing doc value and a newly observed value
  • min - record the smaller of an existing doc value and a newly observed value
  • increment - add a new value to an existing field value
  • duration -record the difference between two date fields
  • add - add a term to a set of values
  • remove - remove a term from a set of values

Being idempotent

In the case of an entity-centric index (e.g. web sessions) which are updated by the latest set of changes from an event-centric index it's important that changes are only applied once. Typically an incremental update is applied as follows:

  1. Use max aggregation to find "last update" timestamp on the entity index ("websessions").
  2. Use the scroll api to query the event-centric store ("weblogs") for the latest events on or before the last update, sorted by entity id (websessionID) and event logged date, ascending.
  3. Use the bulk api to bundle batches of update requests to the entity store (one update per websessionID).

If a failure occurs between 2 and 3 we would need to re-run the batch of changes again which may mean replaying some events already updated in the entity store. This is where it becomes important that the update script is idempotent e.g does not increment numPageViews again. This can be achieved if the update script checks the current event date and a lastUpdated date maintained on the entity document.

Going forward

This issue is possibly of interest to the dataframes work going on in ML (Dataframes are currently a one-off data fusion using aggs and I doubt aggs can underpin data gathering/fusion required for incremental updates).
We can use this issue to define and extend the set of operations we think would be useful to express in a declarative fashion. An initial example of a generic painless script is here.

@markharwood markharwood added >enhancement :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Mar 13, 2019
@markharwood markharwood self-assigned this Mar 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@ywelsch
Copy link
Contributor

ywelsch commented Mar 14, 2019

These sorts of operations could be simplified using a higher-level language designed to update documents. Long-term this may be a new endpoint with a new DSL but much of what is required can be achieved today using a generic painless script driven by the params part of an update request. The proposal is that we experiment with features using this scripted approach and later turn into a Java-based endpoint once we have determined a useful set of operations/syntax.

I wonder if ingest processors might be a way to provide such a DSL. Assume we have implemented a generic "tag" processor which abides to the 5 rules that you've described, we could write the following:

PUT _ingest/pipeline/tag-high-risk
{
  "description" : "high risk tagger",
  "processors" : [ {
      "tag" : {
        "field": "tags",
        "value": "high-risk"
      }
  } ]
}
POST test/_update_by_query?pipeline=tag-high-risk
{
  "query":{....}
  ...
}

@markharwood
Copy link
Contributor Author

I wonder if ingest processors might be a way to provide such a DS

We rejected that idea a while ago because of the way we organise data nodes and ingest nodes separately.

@markharwood
Copy link
Contributor Author

This is working example of updating a web session with various properties:

POST test/_doc/my_web_session1/_update
{
  "upsert": {},
  "scripted_upsert": true,
  "script": {
	"id": "incremental_update",
	"params": {
	  "commands": [
		{"add": {"field": "userAgents","value": "IE10"}},
		{"size": {"field": "userAgents", "resultField": "numBrowsers"}},
		{
		  "min": { "field": "start", "value": "now",
			"onChange": [
			  { "set": {"entryPage": "www.foo.com/checkout" }}
			]
		  }
		},
		{
		  "max": { "field": "lastAccess", "value": "now",
			"onChange": [
			  { "set": { "exitPage": "www.foo.com/checkout"}}
			]
		  }
		},
		{"diff": { "minField": "start", "maxField": "lastAccess","diffField": "duration"}},
		{ "increment": {"field": "pageViews"}}
	  ],
	  "lastUpdateField": "lastUpdate"
	}
  }
}

This is updating a session with the latest web log event (a user accessed www.foo.com/checkout using IE10). Various session properties are recalculated as a result but there's various things that I find troubling here:
There's an assumption that the calling client is either:
a) passing one event at a time in the update or
b) has "thinned out" multiple events and applied some client side logic.

If a) then this is inefficient - there's no batching of multiple events into a single update.
If b) then we are complicating the client and duplicating logic - the client would need to know that www.foo.com/homepage is the value to set for the startPage field and that the later www.foo.com/checkout is the value to set for the exitPage field. This client-side logic reduction of events is similar to the server-side script that merges new data into old as part of updating an existing doc.

The alternative is that clients are kept simple and they pass bundles of raw events which are reduced by the server-side script. In this case the scripted commands are run repeatedly for each of the documents bundled in an array e.g.:

POST test/_doc/my_web_session1/_update
{
  "upsert": {},
  "scripted_upsert": true,
  "script": {
	"id": "incremental_update",
	"params": {
	  "data":[
		{ "date":"2019-01-01: 12:00", "url":"www.foo.com/homepage"},
		{ "date":"2019-01-01: 12:01", "url":"www.foo.com/checkout"}
		...
		],
	  "commands": [
		{
		  "min": { "field": "sessionStart", "srcField": "date",
			"onChange": [
			  { "set": {"entryPage": "url" }}
			]
		  }
		},
		{
		  "max": { "field": "lastAccess", "srcField": "date",
			"onChange": [
			  { "set": { "exitPage": "url"}}
			]
		  }
		},
		{"diff": { "minField": "start", "maxField": "lastAccess","diffField": "duration"}},
	  ],
	  "lastUpdateField": "lastUpdate"
	}
  }
}

In this scenario the commands are run for each document in the data section. The commands no longer contain data values but instead field names that reference values in the current document from the data section.

So the summary of possible reduction techniques are:

Approach How Concerns
One event per update Update script is invoked for each new event Can be costly in terms of number of updates - no batching
Pre-reduced events per update Client reduces new events prior to sending to update script Duplication of client and server-side reduction logic
Multiple raw events per update Client sends raw new events in bundles to update script Requires multiple passes of update script prior to update
No update scripts Client pulls session document to client, updates with latest events then reindexes Requires custom client-side coding

I'm leaning towards this "Multiple raw events per update " option

@markharwood
Copy link
Contributor Author

markharwood commented Mar 18, 2019

Entity attributes that we see as being of interest:

Attribute type Update approach Example use case
Last known state Set keyword field with status of most recent event security credentials, number of still-running processes
Last sighting date Set date field with timestamp of most recent event Expiring web sessions
Last sighted value Set numeric field with value of most recent logged event Internet of things summarise current status
Activity summary Add to Set of terms. (Need to cap size of set so use FIFO, LIFO or PQ) Music recommendations based on collections of listened-to artists.
Max/min/avg/sum Update based on a new logged event's numeric value Web session: sum of all bytes downloaded
Counts Increment with each logged event or new unique field value. Session hijack: Web sessions that use >1 UserAgent.
Time window counts Like ordinary counts but scoped for a time period and reset to zero when elapsed Hacker blocking: Find ip addresses that use >10 account IDs per day
Ratios Derive from two numeric values on the entity Scraper Bot detection: countHTMLdownloads vs countJavascriptDownloads ratio
Diffs Calculate the change between observed values Calculate web session durations or find space/time travellers

Each of these mutations may be surrounded with a condition eg we only want to update a count "where status != 200".

@markharwood
Copy link
Contributor Author

Data modelling issue: - representing sessions

When we're not lucky enough to have session IDs we might want to break an entity's activity stream into discrete chunks of activity that we could think of as sessions. Using IP address as an entity ID I was able to use an update script to summarise weblogs and detect new sessions for an IP (where there's at least an hour between current log record and previous logged activity). The question is how best to represent these sessions in elasticsearch?

The trade-off seems to be optimising for ease of update versus ease of query/analysis

Approach How Pros Cons
Date array in entity doc An update script can detect the start of a new session and drop the date into an array of "sessionStart" dates Update scripts offer local patching of data Analysis options are limited - we can show the multiple dates on a date histogram but they can't be used to "zoom in" on multi-valued fields. Also, sessions don't have any other metadata eg duration or number of requests
Session object array in entity doc An update script can detect the start of a new session and create rich new session objects inside an array called sessions which is of type nested Update scripts offer local patching of data. Sessions can have other metadata e.g. duration or number of requests Analysis is technically possible using nested aggs but in practice limited by Kibana's lack of support for nested.
New session documents A custom client would have to read entity's last session record in order to determine if a new logged event represents a new session and if so create a new session document Output content is easier to analyse in Kibana Not as efficient to perform updates (prior history must be transferred to a client before any update can be contemplated)
Time-aligned sessions The "entity" in our entity-centric index becomes a combination of actor ID (e.g. IP address) and a truncated time stamp (e.g. YYYYMMDDHH) Simplifies updates and analysis "True" sessions may be split into two if they straddle top-of-the-hour boundaries.

@markharwood
Copy link
Contributor Author

The latest prototype for a generic upsert script is here with an example of updating websessions with weblog data to maintain session durations and entry and exit pages

@rjernst rjernst added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jdconrad jdconrad added team-discuss and removed needs:triage Requires assignment of a team area label labels Dec 9, 2020
@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2021

@jdconrad this has both distrib and core/infra label as well as team-discuss label? Are you intending to discuss this issue in the core/infra sync?

@jdconrad
Copy link
Contributor

jdconrad commented Jan 6, 2021

@yannick I think that label may have been added too hastily when we went through all of the scripting issues last month. I'll discuss this within my team first.

@rjernst
Copy link
Member

rjernst commented Jan 7, 2021

This issue has some interesting ideas for a higher level semantic language that makes updates easier. However, in almost 2 years it hasn't gained any traction apart from initial prototyping. Additionally, building this as a layer directly on scripting has long term impact on the ability to change the scripting API, and such a feature may work better as it's own dedicated API (as mentioned within the discussion here). We discussed this issue today in our Painless sync, and we believe the complexity and cost of such a feature is not worthwhile at this time. Thus we are closing this issue, but if there is any future desire for it we can always reopen.

@rjernst rjernst closed this as completed Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

5 participants