value_count Aggregation optimization (backport of #54854) #55076

nik9000 · 2020-04-10T16:30:57Z

We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

hits    |   avg   |   sum   | value_count |

----------- | ------- | ------- | ----------- |
20,000 | .038s | .033s | .063s |
200,000 | .127s | .125s | .334s |
2,000,000 | .789s | .729s | 3.176s |
20,000,000 | 4.200s | 3.239s | 22.787s |
200,000,000 | 21.000s | 22.000s | 154.917s |

The performance of avg, sum and other is very close when performing
statistics, but the performance of value_count has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that value_count and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of value_count.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

Number cast to string overhead, and GC problems caused by a large
number of strings
After the number type is converted to string, sorting and other
unnecessary operations are performed

Here is the proof of type conversion overhead.

// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}

test type	average	min	max	sum
double->long	32.2ns	28ns	0.024ms	3.22s
long->double	31.9ns	28ns	0.036ms	3.19s
long->String	163.8ns	93ns	1921 ms	16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
        |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
        |         |         |   before    |    after    |   before    |    after    |   before    |    after    |

----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s |
200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s |
2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s |
20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s |
200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s |

The results are more in line with common sense. value_count is about
the same as avg, sum, etc., or even lower than these. Previously,
value_count was much larger than avg and sum, and it was not even an
order of magnitude when the amount of data was large.
When calculating numeric types such as double and long, the
performance is improved by about 8 to 9 times; when calculating the
geo_point type, the performance is improved by 18 to 20 times.

We found some problems during the test. Data: 200Million docs, 1 shard, 0 replica hits | avg | sum | value_count | ----------- | ------- | ------- | ----------- | 20,000 | .038s | .033s | .063s | 200,000 | .127s | .125s | .334s | 2,000,000 | .789s | .729s | 3.176s | 20,000,000 | 4.200s | 3.239s | 22.787s | 200,000,000 | 21.000s | 22.000s | 154.917s | The performance of `avg`, `sum` and other is very close when performing statistics, but the performance of `value_count` has always been poor, even not on an order of magnitude. Based on some common-sense knowledge, we think that `value_count` and sum are similar operations, and the time consumed should be the same. Therefore, we have discussed the agg of `value_count`. The principle of counting in es is to traverse the field of each document. If the field is an ordinary value, the count value is increased by 1. If it is an array type, the count value is increased by n. However, the problem lies in traversing each document and taking out the field, which changes from disk to an object in the Java language. We summarize its current problems with Elasticsearch as: - Number cast to string overhead, and GC problems caused by a large number of strings - After the number type is converted to string, sorting and other unnecessary operations are performed Here is the proof of type conversion overhead. ``` // Java long to string source code, getChars is very time-consuming. public static String toString(long i) { int size = stringSize(i); if (COMPACT_STRINGS) { byte[] buf = new byte[size]; getChars(i, size, buf); return new String(buf, LATIN1); } else { byte[] buf = new byte[size * 2]; StringUTF16.getChars(i, size, buf); return new String(buf, UTF16); } } ``` test type | average | min | max | sum ------------ | ------- | ---- | ----------- | ------- double->long | 32.2ns | 28ns | 0.024ms | 3.22s long->double | 31.9ns | 28ns | 0.036ms | 3.19s long->String | 163.8ns | 93ns | 1921 ms | 16.3s particularly serious. Our optimization code is actually very simple. It is to manage different types separately, instead of uniformly converting to string unified processing. We added type identification in ValueCountAggregator, and made special treatment for number and geopoint types to cancel their type conversion. Because the string type is reduced and the string constant is reduced, the improvement effect is very obvious. hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count | | | | double | double | keyword | keyword | geo_point | geo_point | | | | before | after | before | after | before | after | ----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | 20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s | 200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s | 2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s | 20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s | 200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s | - The results are more in line with common sense. `value_count` is about the same as `avg`, `sum`, etc., or even lower than these. Previously, `value_count` was much larger than avg and sum, and it was not even an order of magnitude when the amount of data was large. - When calculating numeric types such as `double` and `long`, the performance is improved by about 8 to 9 times; when calculating the `geo_point` type, the performance is improved by 18 to 20 times.

nik9000 · 2020-04-10T17:18:14Z

Hmm - github's merging this makes it look like I wrote it. Lies! Master has it right.

nik9000 added backport v7.8.0 labels Apr 10, 2020

nik9000 changed the title ~~value_count Aggregation optimization (#54854)~~ value_count Aggregation optimization (backport of #54854) Apr 10, 2020

nik9000 merged commit b99a50b into elastic:7.x Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

value_count Aggregation optimization (backport of #54854) #55076

value_count Aggregation optimization (backport of #54854) #55076

Uh oh!

nik9000 commented Apr 10, 2020 •

edited

Loading

Uh oh!

nik9000 commented Apr 10, 2020

Uh oh!

Uh oh!

value_count Aggregation optimization (backport of #54854) #55076

value_count Aggregation optimization (backport of #54854) #55076

Uh oh!

Conversation

nik9000 commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 commented Apr 10, 2020

Uh oh!

Uh oh!

nik9000 commented Apr 10, 2020 •

edited

Loading