Skip to content

value_count Aggregation optimization (backport of #54854) #55076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 10, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 10, 2020

We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

hits    |   avg   |   sum   | value_count |

----------- | ------- | ------- | ----------- |
20,000 | .038s | .033s | .063s |
200,000 | .127s | .125s | .334s |
2,000,000 | .789s | .729s | 3.176s |
20,000,000 | 4.200s | 3.239s | 22.787s |
200,000,000 | 21.000s | 22.000s | 154.917s |

The performance of avg, sum and other is very close when performing
statistics, but the performance of value_count has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that value_count and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of value_count.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

  • Number cast to string overhead, and GC problems caused by a large
    number of strings
  • After the number type is converted to string, sorting and other
    unnecessary operations are performed

Here is the proof of type conversion overhead.

// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}
test type average min max sum
double->long 32.2ns 28ns 0.024ms 3.22s
long->double 31.9ns 28ns 0.036ms 3.19s
long->String 163.8ns 93ns 1921 ms 16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
        |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
        |         |         |   before    |    after    |   before    |    after    |   before    |    after    |

----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s |
200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s |
2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s |
20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s |
200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s |

  • The results are more in line with common sense. value_count is about
    the same as avg, sum, etc., or even lower than these. Previously,
    value_count was much larger than avg and sum, and it was not even an
    order of magnitude when the amount of data was large.
  • When calculating numeric types such as double and long, the
    performance is improved by about 8 to 9 times; when calculating the
    geo_point type, the performance is improved by 18 to 20 times.

We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

    hits    |   avg   |   sum   | value_count |
----------- | ------- | ------- | ----------- |
     20,000 |   .038s |   .033s |       .063s |
    200,000 |   .127s |   .125s |       .334s |
  2,000,000 |   .789s |   .729s |      3.176s |
 20,000,000 |  4.200s |  3.239s |     22.787s |
200,000,000 | 21.000s | 22.000s |    154.917s |

The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

- Number cast to string overhead, and GC problems caused by a large
  number of strings
- After the number type is converted to string, sorting and other
  unnecessary operations are performed

Here is the proof of type conversion overhead.

```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}
```

  test type  | average |  min |     max     |   sum
------------ | ------- | ---- | ----------- | -------
double->long |  32.2ns | 28ns |     0.024ms |  3.22s
long->double |  31.9ns | 28ns |     0.036ms |  3.19s
long->String | 163.8ns | 93ns |  1921    ms | 16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

    hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
            |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
            |         |         |   before    |    after    |   before    |    after    |   before    |    after    |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
     20,000 |     38s |   .033s |       .063s |       .026s |       .030s |       .030s |       .038s |       .015s |
    200,000 |    127s |   .125s |       .334s |       .078s |       .116s |       .099s |       .278s |       .031s |
  2,000,000 |    789s |   .729s |      3.176s |       .439s |       .348s |       .386s |      3.365s |       .178s |
 20,000,000 |  4.200s |  3.239s |     22.787s |      2.700s |      2.500s |      2.600s |     25.192s |      1.278s |
200,000,000 | 21.000s | 22.000s |    154.917s |     18.990s |     19.000s |     20.000s |    168.971s |      9.093s |

- The results are more in line with common sense. `value_count` is about
  the same as `avg`, `sum`, etc., or even lower than these. Previously,
  `value_count` was much larger than avg and sum, and it was not even an
  order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
  performance is improved by about 8 to 9 times; when calculating the
  `geo_point` type, the performance is improved by 18 to 20 times.
@nik9000 nik9000 changed the title value_count Aggregation optimization (#54854) value_count Aggregation optimization (backport of #54854) Apr 10, 2020
@nik9000 nik9000 merged commit b99a50b into elastic:7.x Apr 10, 2020
@nik9000
Copy link
Member Author

nik9000 commented Apr 10, 2020

Hmm - github's merging this makes it look like I wrote it. Lies! Master has it right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants