value_count Aggregation optimization #54854

xjtushilei · 2020-04-07T01:38:51Z

We found some problems during the test.

Data: 200Million docs， 1 shard，0 replica。

Hit	avg	sum	value_count
20k	38ms	33ms	63ms
200k	127ms	125ms	334ms
2Million	789ms	729ms	3.176s
20Million	4.2s	3.239s	22.787s
200Million(100%)	21s	22s	154.917s

The performance of avg, sum and other is very close when performing statistics, but the performance of value_count has always been poor, even not on an order of magnitude. Based on some common-sense knowledge, we think that value_count and sum are similar operations, and the time consumed should be the same. Therefore, we have discussed the agg of value_count.

The principle of counting in es is to traverse the field of each document. If the field is an ordinary value, the count value is increased by 1. If it is an array type, the count value is increased by n. However, the problem lies in traversing each document and taking out the field, which changes from disk to an object in the Java language. We summarize its current problems with Elasticsearch as:

Number cast to string overhead, and GC problems caused by a large number of strings
After the number type is converted to string, sorting and other unnecessary operations are performed

Here is the proof of type conversion overhead.

// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}

test type	average	min	max	sum
double->long	32.2ns	28ns	0.024ms	3.22s
long->double	31.9ns	28ns	0.036ms	3.19s
long->String	163.8ns	93ns	1921ms	16.3s

#36752 The program heat map shows that the toString time is particularly serious.

optimization

Our optimization code is actually very simple. It is to manage different types separately, instead of uniformly converting to string unified processing. We added type identification in ValueCountAggregator, and made special treatment for number and geopoint types to cancel their type conversion. Because the string type is reduced and the string constant is reduced, the improvement effect is very obvious.

result

Hit	avg	sum	value_countdouble before	value_countdouble after	value_countkeyword before	value_countkeyword after	value_countgeo_point before	value_countgeo_point after
20k	38ms	33ms	63ms	26ms	30ms	30ms	38ms	15ms
200k	127ms	125ms	334ms	78ms	116ms	99s	278ms	31ms
2Million	789ms	729ms	3.176s	439ms	348ms	386ms	3.365s	178ms
20Million	4.2s	3.239s	22.787s	2.7s	2.5s	2.6s	25.192s	1.278s
200Million(100%)	21s	22s	154.917s	18.99s	19s	20s	168.971s	9.093s

The results are more in line with common sense. valuecount is about the same as avg, sum, etc., or even lower than these. Previously, valuecount was much larger than avg and sum, and it was not even an order of magnitude when the amount of data was large.
When calculating numeric types such as double and long, the performance is improved by about 8 to 9 times; when calculating the geo_point type, the performance is improved by 18 to 20 times.

cla-checker-service · 2020-04-07T01:38:55Z

💚 CLA has been signed

elasticmachine · 2020-04-07T07:41:55Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

nik9000 · 2020-04-07T12:27:28Z

Nice!

@elasticmachine, ok to test.

@xjtushilei, could you sign the CLA so we can get this in and work on it?

nik9000 · 2020-04-07T12:31:10Z

server/src/main/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregator.java

@@ -66,6 +66,34 @@ public LeafBucketCollector getLeafCollector(LeafReaderContext ctx,
            return LeafBucketCollector.NO_OP_COLLECTOR;
        }
        final BigArrays bigArrays = context.bigArrays();
+
+        if (valuesSource instanceof ValuesSource.Numeric) {


I think it'd be nice to have a "value counting" member in ValuesSource but I'd be happy to merge this as is and open up a follow up myself to do that. Or you can, if you want @xjtushilei.

As much as it is annoying to put something just for one agg in ValuesSource, I think it'd be nice to have a something in there just so it is more obvious that we do these sorts of things to the values.

Another option is to use the values source refactor to plug in different count implementations. That'd probably be more in line with the direction we're going now.

Either way, I'm happy to take this as is and do the twisting around myself in a follow up change and ping you for review.

I think it'd be nice to have a "value counting" member in ValuesSource but I'd be happy to merge this as is and open up a follow up myself to do that. Or you can, if you want @xjtushilei.

yes, you can do it.

nik9000 · 2020-04-07T12:41:07Z

@xjtushilei, looks like Jenkins hit a compile error. Are you ok to fix it?

xjtushilei · 2020-04-08T04:12:53Z

Nice!

@elasticmachine, ok to test.

@xjtushilei, could you sign the CLA so we can get this in and work on it?

i have signed the CLA with Email "[email protected]"，the email is Email of GIT submitted records.
maybe it is not my github account email?

Today, I have added this email to my GitHub account email.

xjtushilei · 2020-04-08T05:19:03Z

@xjtushilei, looks like Jenkins hit a compile error. Are you ok to fix it?

I tested it in 6.x branch with no compilation errors , and the latest master branch has not been compiled.

When I tried in the master branch, ". / gradlew idea" was too slow. Could you fix it ?

nik9000 · 2020-04-08T13:00:13Z

When I tried in the master branch, ". / gradlew idea" was too slow. Could you fix it ?

I think that ./gradlew idea has been turned off in master. I'm probably the not the right person to help because I use Eclipse, but CONTRIBUTING.md says to import it as a gradle project. If you don't want to go through with all that I'd be happy to finish the PR up for you, but if you want to keep contributing it is probably worth it to get gradle resolved.

Today, I have added this email to my GitHub account email.

One of those things did the trick. I never remember all of the rules for the CLA system. Thanks for making those changes! It is all good on my side now.

nik9000 · 2020-04-09T19:03:34Z

@xjtushilei, I've pushed a fix for the failing test. That test actually shows that this is a breaking change but in a super esoteric way: if you run with a script the input will change from a String to the appropriate type. In the case of the issue in comes back as some kind of number. I'll still backport it to 7.8.0 and add a breaking change note. I don't believe it is super likely to break folks so I'm ok making a breaking change in a minor for this.

xjtushilei · 2020-04-10T01:04:13Z

@xjtushilei, I've pushed a fix for the failing test. That test actually shows that this is a breaking change but in a super esoteric way: if you run with a script the input will change from a String to the appropriate type. In the case of the issue in comes back as some kind of number. I'll still backport it to 7.8.0 and add a breaking change note. I don't believe it is super likely to break folks so I'm ok making a breaking change in a minor for this.

That sounds cool.

nik9000 · 2020-04-10T15:10:12Z

OK! Jenkins has approved this so I've merged and am backporting. I took some liberties turning your PR description into a commit message. You should see a backport PR sometime in the next hour or so.

We found some problems during the test. Data: 200Million docs, 1 shard, 0 replica hits | avg | sum | value_count | ----------- | ------- | ------- | ----------- | 20,000 | .038s | .033s | .063s | 200,000 | .127s | .125s | .334s | 2,000,000 | .789s | .729s | 3.176s | 20,000,000 | 4.200s | 3.239s | 22.787s | 200,000,000 | 21.000s | 22.000s | 154.917s | The performance of `avg`, `sum` and other is very close when performing statistics, but the performance of `value_count` has always been poor, even not on an order of magnitude. Based on some common-sense knowledge, we think that `value_count` and sum are similar operations, and the time consumed should be the same. Therefore, we have discussed the agg of `value_count`. The principle of counting in es is to traverse the field of each document. If the field is an ordinary value, the count value is increased by 1. If it is an array type, the count value is increased by n. However, the problem lies in traversing each document and taking out the field, which changes from disk to an object in the Java language. We summarize its current problems with Elasticsearch as: - Number cast to string overhead, and GC problems caused by a large number of strings - After the number type is converted to string, sorting and other unnecessary operations are performed Here is the proof of type conversion overhead. ``` // Java long to string source code, getChars is very time-consuming. public static String toString(long i) { int size = stringSize(i); if (COMPACT_STRINGS) { byte[] buf = new byte[size]; getChars(i, size, buf); return new String(buf, LATIN1); } else { byte[] buf = new byte[size * 2]; StringUTF16.getChars(i, size, buf); return new String(buf, UTF16); } } ``` test type | average | min | max | sum ------------ | ------- | ---- | ----------- | ------- double->long | 32.2ns | 28ns | 0.024ms | 3.22s long->double | 31.9ns | 28ns | 0.036ms | 3.19s long->String | 163.8ns | 93ns | 1921 ms | 16.3s particularly serious. Our optimization code is actually very simple. It is to manage different types separately, instead of uniformly converting to string unified processing. We added type identification in ValueCountAggregator, and made special treatment for number and geopoint types to cancel their type conversion. Because the string type is reduced and the string constant is reduced, the improvement effect is very obvious. hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count | | | | double | double | keyword | keyword | geo_point | geo_point | | | | before | after | before | after | before | after | ----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | 20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s | 200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s | 2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s | 20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s | 200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s | - The results are more in line with common sense. `value_count` is about the same as `avg`, `sum`, etc., or even lower than these. Previously, `value_count` was much larger than avg and sum, and it was not even an order of magnitude when the amount of data was large. - When calculating numeric types such as `double` and `long`, the performance is improved by about 8 to 9 times; when calculating the `geo_point` type, the performance is improved by 18 to 20 times.

xjtushilei changed the title ~~count Aggregation optimization~~ value_count Aggregation optimization Apr 7, 2020

xjtushilei mentioned this pull request Apr 7, 2020

Specialize ValueCountAggregator depending on type of ValuesSource #36752

Closed

henningandersen added :Analytics/Aggregations Aggregations >enhancement labels Apr 7, 2020

nik9000 reviewed Apr 7, 2020

View reviewed changes

nik9000 added v7.8.0 v8.0.0 labels Apr 7, 2020

xjtushilei and others added 2 commits April 9, 2020 09:26

count Aggregation optimization

f61904d

Scripts change slightly. neat!

c241bd7

nik9000 added the >breaking label Apr 9, 2020

nik9000 merged commit 8e8ce96 into elastic:master Apr 10, 2020

nik9000 added the backport label Apr 10, 2020

nik9000 added backport pending and removed backport labels Apr 10, 2020

nik9000 removed the backport pending label Apr 10, 2020

jakelandis mentioned this pull request Feb 22, 2021

DRAFT [META] REST Compatible API V7 completeness #68905

Closed

jakelandis removed the v8.0.0 label Jul 26, 2021

jakelandis added the v8.0.0-alpha1 label Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

value_count Aggregation optimization #54854

value_count Aggregation optimization #54854

Uh oh!

xjtushilei commented Apr 7, 2020

Uh oh!

cla-checker-service bot commented Apr 7, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Apr 7, 2020

Uh oh!

nik9000 commented Apr 7, 2020 •

edited

Loading

Uh oh!

nik9000 Apr 7, 2020

Uh oh!

nik9000 Apr 7, 2020

Uh oh!

nik9000 Apr 7, 2020

Uh oh!

xjtushilei Apr 8, 2020

Uh oh!

nik9000 commented Apr 7, 2020

Uh oh!

xjtushilei commented Apr 8, 2020 •

edited

Loading

Uh oh!

xjtushilei commented Apr 8, 2020

Uh oh!

nik9000 commented Apr 8, 2020

Uh oh!

nik9000 commented Apr 9, 2020

Uh oh!

xjtushilei commented Apr 10, 2020

Uh oh!

nik9000 commented Apr 10, 2020

Uh oh!

Uh oh!

value_count Aggregation optimization #54854

value_count Aggregation optimization #54854

Uh oh!

Conversation

xjtushilei commented Apr 7, 2020

optimization

result

Uh oh!

cla-checker-service bot commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 7, 2020

Uh oh!

nik9000 commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

nik9000 Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

nik9000 Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

xjtushilei Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 7, 2020

Uh oh!

xjtushilei commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xjtushilei commented Apr 8, 2020

Uh oh!

nik9000 commented Apr 8, 2020

Uh oh!

nik9000 commented Apr 9, 2020

Uh oh!

xjtushilei commented Apr 10, 2020

Uh oh!

nik9000 commented Apr 10, 2020

Uh oh!

Uh oh!

cla-checker-service bot commented Apr 7, 2020 •

edited

Loading

nik9000 commented Apr 7, 2020 •

edited

Loading

xjtushilei commented Apr 8, 2020 •

edited

Loading