Append-only indices #18069

prog8 · 2016-04-29T13:22:14Z

Hi,
Elasticsearch currently doesn't have a parameter which switches indices in append only mode. This could be interesting from performance point of view.

We are dealing with a problem of limited ingestion rate in ES. We tried to focus on a couple of hot spots to resolve potential bottlenecks. One of them was DocValues merges for sparse fields, then after changing approach synchronous translog became a bottleneck. It even seems that contention problem is not negligible: #18053. These are only a few places where we tried to find optimizations. While analyzing profiles in Java Mission Control it was hard to ignore the another major problem: versions lookup. See the image below.

Profiler shows that versions lookups are taking up to 25% of CPU time. This is a lot. Of course one could say I am showing a very specific case because I am using tiny documents in tests. I think this is not exactly true. Elasticsearch stack is much focused on logs aggregation and recently also metrics (Logstash, Kibana, Marvel, Beats, ...). Most systems designed for metric or log aggregation indices are append-only. Of course ES is general purpose distributed search and analytics engine so document versions are really important ... but in many cases append-only mode could be sufficient.
One of the major strengths of Lucene internally is that segments are append only. Why not to use this simple but powerful assumption on higher level? Not all ES use cases require concurrent update functionality in which versions control really matters. What if Elasticsearch indices are marked as append-only (on they are created. Indices have properties which cannot be changed after index is created (eg. number of shards). This could be another parameter which can be applied only on index creation.

The contention problem mentioned in #18053 would also become less painful in append-only indices. Version lookup is inside synchronized block in InternalEngine#innerIndex. This means threads are in synchronized section for a longer time than needed (in case where versions are not needed).

During tests we disabled versions control in Elasticsearch which brought us significant improvement in indexing speed. Look at the chart below. Indexing speed on the same indices (without removing old data - just changed ES version) increased by 50% only by removing a version control.

My test environment is not very powerful. Only 2 machines with 4 CPU cores each. Indices have only one replica.

https://apps.sematext.com/spm-reports/s/ucSD8cuTRr

I would really appreciate any feedback. Am I missing something obvious? Thanks

jasontedor · 2016-04-29T13:29:38Z

Can you try rerunning the profile with the changes from #18060/#18067 to address the contention in #18053; you should see almost no contention now in an append-only workload. Please see #9125 and #13857 for why the version checks will not be removed.

prog8 · 2016-04-29T14:04:33Z

Hi Jason. Thank you for pointing me to the fix. It seems this is very fresh fix :) Contention is only small problem comparing to versions where you could really see CPU cycles wasted.

Thanks for sharing #9125 and #13857. They both confirm that ES is general purpose solution which prevents users from unexpected cases. I think in some cases one can ignore the fact that data is duplicated because of failed bulk which was sent to ES again. Of course this would be important problem for credit card transactions stored in index but it is not important at all if one keeps for example application monitoring datapoints.

mikemccand · 2016-04-29T18:27:21Z

@prog8 which ES version is this? Can you try with 5.0.0 alpha1? E.g. we've made mmapfs the default, which is important for the version lookups since with niofs a seek + read a few bytes requires filling an 8 KB buffer each time.

Do your shards have a contained number of segments? Version lookup cost can be linear cost in the number of segments.

Are you using ES's IDs? ES assigns IDs in a predictable way (a derivative of Flake IDs), which enables Lucene to sometimes skip whole segments that cannot possible contain a given ID.

Do you leave plenty of free RAM for the OS to cache hot pages? Specifically, the terms dict files need to remain hot so the version lookups don't hit disk.

In the past when I've tested this, the hit was much lower than 50% (more like 5%), on the logging use case.

I think if ES used segment files NRT replication (recently added to Lucene: https://issues.apache.org/jira/browse/LUCENE-5438) instead of document replication, supporting append only indexing should be much easier since there is a "single source of truth" (the primary shard) ... and it would mean less CPU on replicas since they just copy files instead of indexing documents again ... but that would be a very large change ;)

otisg · 2016-04-30T04:01:37Z

@mikemccand - this was ES 5.0 alpha1 (and 2).

Do your shards have a contained number of segments? Version lookup cost can be linear cost in the number of segments.

In order to reduce expensive merges we've allowed a larger number of segments. Sounds like that's more expensive for version lookups, but if we change settings to reduce the number of segments then we'll pay the price in segment merges.

Are you using ES's IDs?

Yes.

mikemccand · 2016-04-30T10:56:44Z

In order to reduce expensive merges we've allowed a larger number of segments. Sounds like that's more expensive for version lookups, but if we change settings to reduce the number of segments then we'll pay the price in segment merges.

Hmm that's usually the wrong tradeoff, but, yes, if you relax merging (allow more segments in the index) then version lookups get slower.

prog8 · 2016-05-01T12:21:01Z

Yeah we used ES 5.0 alpha2 directly from master branch. It means we are really up to date with all bug fixes and improvements.

Hmm that's usually the wrong tradeoff, but, yes, if you relax merging (allow more segments in the index) then version lookups get slower.

It means we are in a position where we either lose indexing throughput by slower versions lookup (more segments) or by more merges (more aggressive merge policy).
We actually didn't measure this precisely, but even with more aggressive merge policy versions lookup is relatively expensive operation, when the number of documents in a single shard grows to tens of millions.

ebuildy · 2016-05-03T10:01:08Z

I agree this is a very common case, we always use immutable index (for log purpose, and because we are using HDFS => Pig => ES) and looking for any fix that could improve indexing rate.

Please can you share what did you do for that? Many thanks.

prog8 · 2016-05-05T08:54:33Z

Hi @ebuildy. I used naive and ugly approach just to see if disabling versions helps and to be able to quickly show a few numbers. I think you don't have to follow this path :) What I actually did was simply commenting out all lines related to versions in core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java in method innerIndex.

ebuildy · 2016-05-05T09:14:56Z

Good to know, thanks you.
Le 5 mai 2016 10:55, "Paweł Róg" [email protected] a écrit :

Hi @ebuildy https://github.com/ebuildy. I used naive and ugly approach
just to see if disabling versions helps and to be able to quickly show a
few numbers. I think you don't have to follow this path :) What I actually
did was simply commenting out all lines related to versions in
core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java in
method innerIndex.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#18069 (comment)

mikemccand · 2016-05-05T10:48:18Z

@prog8 @otisg Thanks for all the answers. How small are these documents? The version lookup cost will be a proportionally higher percentage the smaller the documents. In the extreme (you index just the _uid) it's very costly...

Does the OS have enough free RAM to keep the terms dict fully hot? top will tell you how many bytes the OS is caching, and you should be able to see the total size of terms dictionary files on disk using the segments stats API, and _uid typically dominates the terms dict size.

ES used to have an optimization to skip the version lookup when it had auto-generated a new ID, however this proved to be dangerous, with error cases where a node to node retry within ES could result in cross shard corruption (replica and primary out of sync) because the replica indexed duplicate documents. #9125 has more details. Maybe we need to revisit if there is a safe way to re-enable this ...

Before that, @s1monw had long ago pursued a branch to optimize for the append-only use case in ES, but that proved to be too complex a "fork" of ES's sources I think.

I still believe the "right" solution to this problem would be for ES to switch to Lucene's new NRT replication (https://issues.apache.org/jira/browse/LUCENE-5438). Because it replicates at the segment file level, it is not possible for a replica to become out of sync versus the primary. But that is an enormous change and has its own complex tradeoffs.

One thing ES should do (this was @rmuir's idea) is to use the full binary term space when indexing _uid ... today it sends the base64 encoded version (25% waste I think) to Lucene, a holdover from the past when Lucene could not index binary terms. I'll open a separate issue for this ...

mikemccand · 2016-05-05T10:55:18Z

I opened #18154 to index _uid in binary form.

prog8 · 2016-05-05T12:12:21Z

@mikemccand thank you for this very extensive response. Our documents are really small. Single document contains a couple of text fields (3-5) which contain short keyword. Moreover there are also 4 pure doc values fields. _all and _source are disabled. This means "versions problem" is more visible in such a case.

I looked at the size of .tim files. Sumaric size of all .tim files is >100GB while machine has only 8GB RAM (half of it used by JVM). This leaves not much space for OS caches. I am only thinking if the problem is in not fully hot term dictionaries shouldn't we see high CPU wait time and many IOPS? This is not the case. CPU wait time is close to 2% and CPU user time is close to 90%.

Thank you for information about previous tries of having append-only ES.
I agree that LUCENE-5438 can be really great. If I understand it correctly, it will reduce CPU usage by a factor of 2 in 1-replica environment (without losing any functionality). Can you reveal a secret about plans of using LUCENE-5438 in ES? I am impatiently looking forward for this change. It can be huge improvement for most of ES setups :)

Once again, thanks for all information. This is really helpful.

clintongormley · 2016-05-06T09:27:23Z

I think there's nothing more to do on this issue, so will close.

mikemccand · 2016-05-06T13:16:35Z

Sumaric size of all .tim files is >100GB while machine has only 8GB RAM (half of it used by JVM). This leaves not much space for OS caches

Wow, that's a very large terms dict. But, yes, high CPU utilization means the OS is somehow keeping things hot (not sure how).

Can you reveal a secret about plans of using LUCENE-5438 in ES? I am impatiently looking forward for this change. It can be huge improvement for most of ES setups :)

Well would be downsides to it as well, e.g. higher NRT refresh latency, merged segments need to be moved on the wire too (we could maybe fix that, but that's also hairy) so it's more network traffic within the cluster. But it would also give precise searcher versions across primary and all replicas, so you are searching the exact point-in-time view regardless of which replica you use. I don't know of anyone exploring doing this for ES now ... it would be a massive change.

jasontedor · 2016-05-06T13:20:31Z

I don't know of anyone exploring doing this for ES now ... it would be a massive change.

This is correct, it is not being explored at this time.

prog8 · 2016-05-06T13:25:33Z

@mikemccand, @jasontedor Thanks for information. I believe that when LUCENE-5438 is taken into consideration for ES I will find the github issue about this :)

bleskes · 2016-12-22T08:02:51Z

BTW - this is also addressed in 5.0 with #20211 and there are more ideas on the table .

clintongormley added discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Apr 29, 2016

clintongormley closed this as completed May 6, 2016

jasontedor mentioned this issue Dec 22, 2016

poor index performance when index loadVersion #22318

Closed

Append-only indices #18069

Append-only indices #18069

Comments

prog8 commented Apr 29, 2016

jasontedor commented Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prog8 commented Apr 29, 2016

Uh oh!

mikemccand commented Apr 29, 2016

Uh oh!

otisg commented Apr 30, 2016

Uh oh!

mikemccand commented Apr 30, 2016

Uh oh!

prog8 commented May 1, 2016

Uh oh!

ebuildy commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prog8 commented May 5, 2016

Uh oh!

ebuildy commented May 5, 2016

Uh oh!

mikemccand commented May 5, 2016

Uh oh!

mikemccand commented May 5, 2016

Uh oh!

prog8 commented May 5, 2016

Uh oh!

clintongormley commented May 6, 2016

Uh oh!

mikemccand commented May 6, 2016

Uh oh!

jasontedor commented May 6, 2016

Uh oh!

prog8 commented May 6, 2016

Uh oh!

bleskes commented Dec 22, 2016

Uh oh!

jasontedor commented Apr 29, 2016 •

edited

Loading

ebuildy commented May 3, 2016 •

edited

Loading