-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First step optimizing tsdb doc values codec merging. #125403
base: main
Are you sure you want to change the base?
Conversation
The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated an expensive doc id sorting is performed (in order to get the doc ids in order of index sorting). There are several reasons why the doc value instance is iterated multiple times: * To compute stats (num values, number of docs with value) required for writing values to disk. * To write bitset that indicate which documents have a value. (indexed disi, jump table) * To write the actual values to disk. * To write the addresses to disk (in case docs have multiple values) This applies for numeric doc values, but also for the ordinals of sorted (set) doc values. This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes.
fixed sorted set dv added unit test with index sorting
52f3084
to
65d97e5
Compare
The attached micro benchmark to tests the tsdb doc value codec with force merge suggests the following:
|
This PR adds a more code than I thought I needed, but the good thing is that the 'format 'code the writes to disk didn't need to be changed. |
Running elastic/logs track (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments. Thanks @martijnvg.
sumNumDocsWithField += entry.numDocsWithField; | ||
} | ||
|
||
// Documents marked as deleted should be rare. Maybe in the case of noop operation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check this before getting docValues?
@Override | ||
public void mergeNumericField(FieldInfo mergeFieldInfo, MergeState mergeState) throws IOException { | ||
var result = compatibleWithOptimizedMerge(enableOptimizedMerge, mergeFieldInfo, mergeState, (docValuesProducer) -> { | ||
var numeric = docValuesProducer.getNumeric(mergeFieldInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we query the NumericEntry directly?
if (docValuesProducer instanceof ES87TSDBDocValuesProducer producer && producer.version == VERSION_CURRENT) {
var entry = producer.numerics.get(mergeFieldInfo.name);
return new DocValuesConsumerUtil.FieldEntry(entry.docsWithFieldOffset, entry.numValues, -1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I had initially, however the type of producer
isn't ES87TSDBDocValuesProducer
, but is PerFieldDocValuesFormat.FieldsReader
. So I ended up with the current workaround, not happy about it, but I don't see another way.
}; | ||
} | ||
|
||
static NumericDocValues mergeNumericValues(List<NumericDocValuesSub> subs, boolean indexIsSorted) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we copy these from DocValuesConsumer? Is there any chance we can avoid copying this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I don't see another way here. All the logic that we need here is private to DocValuesConsumer
only.
Running tsdb track without this change as baseline and with this change as contender:
Unfortunately the improvement is less visible here. Looks like ~3% less time spent on merging. |
Thanks Nhat for the review! |
…ctly in compatibleWithOptimizedMerge(...) method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better. Thanks, Martijn. Would you mind taking a look at the test failures? They seem related.
@@ -1262,6 +1262,15 @@ public long longValue() throws IOException { | |||
} | |||
} | |||
|
|||
abstract static class BaseNumericDocValues extends NumericDocValues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we no longer need these base classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point we no longer need these base classes. I will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed: 39dc98f
They are related and I think this will address the bwc failures: 5a2da25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Martijn!
Unfortunately it isn't possible to run the release tests due to:
I did confirm that locally the unit tests pass with release build (meaning the feature flag is disabled):
|
The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated an expensive doc id sorting is performed (in order to get the doc ids in order of index sorting).
There are several reasons why the doc value instance is iterated multiple times:
This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.
This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes.