Can we somehow make the engine's LiveVersionMap tracking optional?

I'm opening this to discuss possible options:

I've been scrutinizing ES indexing performance on the NYC taxi data set (1.2 B taxi rides, numerics heavy: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

These documents are small (24 fields, though a bit sparse with ~23% cells missing) and are almost entirely numbers (indexed as points + doc values).

As a "ceiling" for indexing performance I also indexed the same data set using Lucene's "thin wrapper" demo server (http://github.com/mikemccand/luceneserver), indexing the same documents as efficiently as I know how (see `indexTaxis.py`).

The demo Lucene server has many differences vs. ES: it has no transaction log (does not periodically fsync), uses `addDocuments` not `updateDocument`, can index from a more compact documents source (190 GB CSV file, vs 512 GB json file for ES), does not add a costly `_uid` field (nor `_version`, `_type`) , uses a streaming bulk API, etc.  I disabled `_all` and `_source` in ES, but net/net ES is substantially slower than the demo Lucene server.

So, one big thing I noticed that is maybe a lowish hanging fruit is that ES loses a lot of its indexing buffer to `LiveVersionMap`: if I give ES 1 GB indexing buffer, and index into only 1 shard, and disable refresh, the version map is taking ~2/3 of that buffer, leaving only ~1/3 for Lucene's `IndexWriter`:

```
node0: [2016-08-03 09:39:07,557][DEBUG][index.engine             ] [node0] [taxis][0] use refresh to write indexing buffer (heap size=[313.7mb]), to also clear version map (heap size=[730.3mb])
```

This also means ES is necessarily doing periodic refresh when I didn't ask it to.

This is quite frustrating because I don't need optimistic concurrency here, nor real-time gets, nor refreshes.  However, I fear the version map might be required during recovery, to ensure when playing back indexing operations from the transaction log that they do not incorrectly overwrite newer indexing operations?  But then, this use case is also append-only, so maybe during recovery we could safely skip that, if the user turns on this new setting.

The version map makes an entry in a `HashMap` for each document indexed, and the entry stores non-trivial information, creating at least 4 new objects, holding longs/ints, etc.  If we can't make it turn-off-able maybe we should instead try to reduce its per-indexing-op overhead...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we somehow make the engine's LiveVersionMap tracking optional? #19787

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can we somehow make the engine's LiveVersionMap tracking optional? #19787

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions