Description
I'm opening this to discuss possible options:
I've been scrutinizing ES indexing performance on the NYC taxi data set (1.2 B taxi rides, numerics heavy: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
These documents are small (24 fields, though a bit sparse with ~23% cells missing) and are almost entirely numbers (indexed as points + doc values).
As a "ceiling" for indexing performance I also indexed the same data set using Lucene's "thin wrapper" demo server (http://github.com/mikemccand/luceneserver), indexing the same documents as efficiently as I know how (see indexTaxis.py
).
The demo Lucene server has many differences vs. ES: it has no transaction log (does not periodically fsync), uses addDocuments
not updateDocument
, can index from a more compact documents source (190 GB CSV file, vs 512 GB json file for ES), does not add a costly _uid
field (nor _version
, _type
) , uses a streaming bulk API, etc. I disabled _all
and _source
in ES, but net/net ES is substantially slower than the demo Lucene server.
So, one big thing I noticed that is maybe a lowish hanging fruit is that ES loses a lot of its indexing buffer to LiveVersionMap
: if I give ES 1 GB indexing buffer, and index into only 1 shard, and disable refresh, the version map is taking ~2/3 of that buffer, leaving only ~1/3 for Lucene's IndexWriter
:
node0: [2016-08-03 09:39:07,557][DEBUG][index.engine ] [node0] [taxis][0] use refresh to write indexing buffer (heap size=[313.7mb]), to also clear version map (heap size=[730.3mb])
This also means ES is necessarily doing periodic refresh when I didn't ask it to.
This is quite frustrating because I don't need optimistic concurrency here, nor real-time gets, nor refreshes. However, I fear the version map might be required during recovery, to ensure when playing back indexing operations from the transaction log that they do not incorrectly overwrite newer indexing operations? But then, this use case is also append-only, so maybe during recovery we could safely skip that, if the user turns on this new setting.
The version map makes an entry in a HashMap
for each document indexed, and the entry stores non-trivial information, creating at least 4 new objects, holding longs/ints, etc. If we can't make it turn-off-able maybe we should instead try to reduce its per-indexing-op overhead...