Skip to content

Can we somehow make the engine's LiveVersionMap tracking optional? #19787

Closed
@mikemccand

Description

@mikemccand

I'm opening this to discuss possible options:

I've been scrutinizing ES indexing performance on the NYC taxi data set (1.2 B taxi rides, numerics heavy: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

These documents are small (24 fields, though a bit sparse with ~23% cells missing) and are almost entirely numbers (indexed as points + doc values).

As a "ceiling" for indexing performance I also indexed the same data set using Lucene's "thin wrapper" demo server (http://github.com/mikemccand/luceneserver), indexing the same documents as efficiently as I know how (see indexTaxis.py).

The demo Lucene server has many differences vs. ES: it has no transaction log (does not periodically fsync), uses addDocuments not updateDocument, can index from a more compact documents source (190 GB CSV file, vs 512 GB json file for ES), does not add a costly _uid field (nor _version, _type) , uses a streaming bulk API, etc. I disabled _all and _source in ES, but net/net ES is substantially slower than the demo Lucene server.

So, one big thing I noticed that is maybe a lowish hanging fruit is that ES loses a lot of its indexing buffer to LiveVersionMap: if I give ES 1 GB indexing buffer, and index into only 1 shard, and disable refresh, the version map is taking ~2/3 of that buffer, leaving only ~1/3 for Lucene's IndexWriter:

node0: [2016-08-03 09:39:07,557][DEBUG][index.engine             ] [node0] [taxis][0] use refresh to write indexing buffer (heap size=[313.7mb]), to also clear version map (heap size=[730.3mb])

This also means ES is necessarily doing periodic refresh when I didn't ask it to.

This is quite frustrating because I don't need optimistic concurrency here, nor real-time gets, nor refreshes. However, I fear the version map might be required during recovery, to ensure when playing back indexing operations from the transaction log that they do not incorrectly overwrite newer indexing operations? But then, this use case is also append-only, so maybe during recovery we could safely skip that, if the user turns on this new setting.

The version map makes an entry in a HashMap for each document indexed, and the entry stores non-trivial information, creating at least 4 new objects, holding longs/ints, etc. If we can't make it turn-off-able maybe we should instead try to reduce its per-indexing-op overhead...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions