-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Append-only indices #18069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Jason. Thank you for pointing me to the fix. It seems this is very fresh fix :) Contention is only small problem comparing to versions where you could really see CPU cycles wasted. Thanks for sharing #9125 and #13857. They both confirm that ES is general purpose solution which prevents users from unexpected cases. I think in some cases one can ignore the fact that data is duplicated because of failed bulk which was sent to ES again. Of course this would be important problem for credit card transactions stored in index but it is not important at all if one keeps for example application monitoring datapoints. |
@prog8 which ES version is this? Can you try with 5.0.0 alpha1? E.g. we've made Do your shards have a contained number of segments? Version lookup cost can be linear cost in the number of segments. Are you using ES's IDs? ES assigns IDs in a predictable way (a derivative of Flake IDs), which enables Lucene to sometimes skip whole segments that cannot possible contain a given ID. Do you leave plenty of free RAM for the OS to cache hot pages? Specifically, the terms dict files need to remain hot so the version lookups don't hit disk. In the past when I've tested this, the hit was much lower than 50% (more like 5%), on the logging use case. I think if ES used segment files NRT replication (recently added to Lucene: https://issues.apache.org/jira/browse/LUCENE-5438) instead of document replication, supporting append only indexing should be much easier since there is a "single source of truth" (the primary shard) ... and it would mean less CPU on replicas since they just copy files instead of indexing documents again ... but that would be a very large change ;) |
@mikemccand - this was ES 5.0 alpha1 (and 2).
In order to reduce expensive merges we've allowed a larger number of segments. Sounds like that's more expensive for version lookups, but if we change settings to reduce the number of segments then we'll pay the price in segment merges.
Yes. |
Hmm that's usually the wrong tradeoff, but, yes, if you relax merging (allow more segments in the index) then version lookups get slower. |
Yeah we used ES 5.0 alpha2 directly from master branch. It means we are really up to date with all bug fixes and improvements.
It means we are in a position where we either lose indexing throughput by slower versions lookup (more segments) or by more merges (more aggressive merge policy). |
I agree this is a very common case, we always use immutable index (for log purpose, and because we are using HDFS => Pig => ES) and looking for any fix that could improve indexing rate. Please can you share what did you do for that? Many thanks. |
Hi @ebuildy. I used naive and ugly approach just to see if disabling versions helps and to be able to quickly show a few numbers. I think you don't have to follow this path :) What I actually did was simply commenting out all lines related to versions in |
Good to know, thanks you.
|
@prog8 @otisg Thanks for all the answers. How small are these documents? The version lookup cost will be a proportionally higher percentage the smaller the documents. In the extreme (you index just the Does the OS have enough free RAM to keep the terms dict fully hot? ES used to have an optimization to skip the version lookup when it had auto-generated a new ID, however this proved to be dangerous, with error cases where a node to node retry within ES could result in cross shard corruption (replica and primary out of sync) because the replica indexed duplicate documents. #9125 has more details. Maybe we need to revisit if there is a safe way to re-enable this ... Before that, @s1monw had long ago pursued a branch to optimize for the append-only use case in ES, but that proved to be too complex a "fork" of ES's sources I think. I still believe the "right" solution to this problem would be for ES to switch to Lucene's new NRT replication (https://issues.apache.org/jira/browse/LUCENE-5438). Because it replicates at the segment file level, it is not possible for a replica to become out of sync versus the primary. But that is an enormous change and has its own complex tradeoffs. One thing ES should do (this was @rmuir's idea) is to use the full binary term space when indexing |
I opened #18154 to index |
@mikemccand thank you for this very extensive response. Our documents are really small. Single document contains a couple of text fields (3-5) which contain short keyword. Moreover there are also 4 pure doc values fields. _all and _source are disabled. This means "versions problem" is more visible in such a case. I looked at the size of .tim files. Sumaric size of all .tim files is >100GB while machine has only 8GB RAM (half of it used by JVM). This leaves not much space for OS caches. I am only thinking if the problem is in not fully hot term dictionaries shouldn't we see high CPU wait time and many IOPS? This is not the case. CPU wait time is close to 2% and CPU user time is close to 90%. Thank you for information about previous tries of having append-only ES. Once again, thanks for all information. This is really helpful. |
I think there's nothing more to do on this issue, so will close. |
Wow, that's a very large terms dict. But, yes, high CPU utilization means the OS is somehow keeping things hot (not sure how).
Well would be downsides to it as well, e.g. higher NRT refresh latency, merged segments need to be moved on the wire too (we could maybe fix that, but that's also hairy) so it's more network traffic within the cluster. But it would also give precise searcher versions across primary and all replicas, so you are searching the exact point-in-time view regardless of which replica you use. I don't know of anyone exploring doing this for ES now ... it would be a massive change. |
This is correct, it is not being explored at this time. |
@mikemccand, @jasontedor Thanks for information. I believe that when LUCENE-5438 is taken into consideration for ES I will find the github issue about this :) |
BTW - this is also addressed in 5.0 with #20211 and there are more ideas on the table . |
Hi,
Elasticsearch currently doesn't have a parameter which switches indices in append only mode. This could be interesting from performance point of view.
We are dealing with a problem of limited ingestion rate in ES. We tried to focus on a couple of hot spots to resolve potential bottlenecks. One of them was DocValues merges for sparse fields, then after changing approach synchronous translog became a bottleneck. It even seems that contention problem is not negligible: #18053. These are only a few places where we tried to find optimizations. While analyzing profiles in Java Mission Control it was hard to ignore the another major problem: versions lookup. See the image below.
Profiler shows that versions lookups are taking up to 25% of CPU time. This is a lot. Of course one could say I am showing a very specific case because I am using tiny documents in tests. I think this is not exactly true. Elasticsearch stack is much focused on logs aggregation and recently also metrics (Logstash, Kibana, Marvel, Beats, ...). Most systems designed for metric or log aggregation indices are append-only. Of course ES is general purpose distributed search and analytics engine so document versions are really important ... but in many cases append-only mode could be sufficient.
One of the major strengths of Lucene internally is that segments are append only. Why not to use this simple but powerful assumption on higher level? Not all ES use cases require concurrent update functionality in which versions control really matters. What if Elasticsearch indices are marked as append-only (on they are created. Indices have properties which cannot be changed after index is created (eg. number of shards). This could be another parameter which can be applied only on index creation.
The contention problem mentioned in #18053 would also become less painful in append-only indices. Version lookup is inside synchronized block in InternalEngine#innerIndex. This means threads are in synchronized section for a longer time than needed (in case where versions are not needed).
During tests we disabled versions control in Elasticsearch which brought us significant improvement in indexing speed. Look at the chart below. Indexing speed on the same indices (without removing old data - just changed ES version) increased by 50% only by removing a version control.
My test environment is not very powerful. Only 2 machines with 4 CPU cores each. Indices have only one replica.
https://apps.sematext.com/spm-reports/s/ucSD8cuTRr

I would really appreciate any feedback. Am I missing something obvious? Thanks
The text was updated successfully, but these errors were encountered: