value_count Aggregation optimization (elastic#54854)

xjtushilei · nik9000 · commit b1afca1cdf27 · 2020-04-10T12:28:48.000-04:00
We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

    hits    |   avg   |   sum   | value_count |
----------- | ------- | ------- | ----------- |
     20,000 |   .038s |   .033s |       .063s |
    200,000 |   .127s |   .125s |       .334s |
  2,000,000 |   .789s |   .729s |      3.176s |
 20,000,000 |  4.200s |  3.239s |     22.787s |
200,000,000 | 21.000s | 22.000s |    154.917s |

The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

- Number cast to string overhead, and GC problems caused by a large
  number of strings
- After the number type is converted to string, sorting and other
  unnecessary operations are performed

Here is the proof of type conversion overhead.

```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}
```

  test type  | average |  min |     max     |   sum
------------ | ------- | ---- | ----------- | -------
double-&gt;long |  32.2ns | 28ns |     0.024ms |  3.22s
long-&gt;double |  31.9ns | 28ns |     0.036ms |  3.19s
long-&gt;String | 163.8ns | 93ns |  1921    ms | 16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

    hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
            |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
            |         |         |   before    |    after    |   before    |    after    |   before    |    after    |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
     20,000 |     38s |   .033s |       .063s |       .026s |       .030s |       .030s |       .038s |       .015s |
    200,000 |    127s |   .125s |       .334s |       .078s |       .116s |       .099s |       .278s |       .031s |
  2,000,000 |    789s |   .729s |      3.176s |       .439s |       .348s |       .386s |      3.365s |       .178s |
 20,000,000 |  4.200s |  3.239s |     22.787s |      2.700s |      2.500s |      2.600s |     25.192s |      1.278s |
200,000,000 | 21.000s | 22.000s |    154.917s |     18.990s |     19.000s |     20.000s |    168.971s |      9.093s |

- The results are more in line with common sense. `value_count` is about
  the same as `avg`, `sum`, etc., or even lower than these. Previously,
  `value_count` was much larger than avg and sum, and it was not even an
  order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
  performance is improved by about 8 to 9 times; when calculating the
  `geo_point` type, the performance is improved by 18 to 20 times.
diff --git a/docs/reference/release-notes.asciidoc b/docs/reference/release-notes.asciidoc
@@ -6,6 +6,7 @@
 
 This section summarizes the changes in each release.
 
+* <<release-notes-7.8.0>>
 * <<release-notes-7.7.0>>
 * <<release-notes-7.6.2>>
 * <<release-notes-7.6.1>>
@@ -32,6 +33,7 @@ This section summarizes the changes in each release.
 
 --
 
+include::release-notes/7.8.asciidoc[]
 include::release-notes/7.7.asciidoc[]
 include::release-notes/7.6.asciidoc[]
 include::release-notes/7.5.asciidoc[]
diff --git a/docs/reference/release-notes/7.8.asciidoc b/docs/reference/release-notes/7.8.asciidoc
@@ -0,0 +1,14 @@
+[[release-notes-7.8.0]]
+== {es} version 7.8.0
+
+coming[7.8.0]
+
+[[breaking-7.8.0]]
+[float]
+=== Breaking changes
+
+Search::
+* Scripts used in `value_count` will now receive a number if they are counting
+  a numeric field and a `GeoPoint` if they are counting a `geo_point` fields.
+  They used to always receive the `String` representation of those values.
+  {pull}54854[#54854]
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregator.java b/server/src/main/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregator.java
@@ -19,9 +19,11 @@
 package org.elasticsearch.search.aggregations.metrics;
 
 import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.SortedNumericDocValues;
 import org.elasticsearch.common.lease.Releasables;
 import org.elasticsearch.common.util.BigArrays;
 import org.elasticsearch.common.util.LongArray;
+import org.elasticsearch.index.fielddata.MultiGeoPointValues;
 import org.elasticsearch.index.fielddata.SortedBinaryDocValues;
 import org.elasticsearch.search.aggregations.Aggregator;
 import org.elasticsearch.search.aggregations.InternalAggregation;
@@ -62,6 +64,34 @@ public LeafBucketCollector getLeafCollector(LeafReaderContext ctx,
             return LeafBucketCollector.NO_OP_COLLECTOR;
         }
         final BigArrays bigArrays = context.bigArrays();
+
+        if (valuesSource instanceof ValuesSource.Numeric) {
+            final SortedNumericDocValues values = ((ValuesSource.Numeric)valuesSource).longValues(ctx);
+            return new LeafBucketCollectorBase(sub, values) {
+
+                @Override
+                public void collect(int doc, long bucket) throws IOException {
+                    counts = bigArrays.grow(counts, bucket + 1);
+                    if (values.advanceExact(doc)) {
+                        counts.increment(bucket, values.docValueCount());
+                    }
+                }
+            };
+        }
+        if (valuesSource instanceof ValuesSource.Bytes.GeoPoint) {
+            MultiGeoPointValues values = ((ValuesSource.GeoPoint)valuesSource).geoPointValues(ctx);
+            return new LeafBucketCollectorBase(sub, null) {
+
+                @Override
+                public void collect(int doc, long bucket) throws IOException {
+                    counts = bigArrays.grow(counts, bucket + 1);
+                    if (values.advanceExact(doc)) {
+                        counts.increment(bucket, values.docValueCount());
+                    }
+                }
+            };
+        }
+        // The following is default collector. Including the keyword FieldType
         final SortedBinaryDocValues values = valuesSource.bytesValues(ctx);
         return new LeafBucketCollectorBase(sub, values) {
 
diff --git a/server/src/test/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregatorTests.java b/server/src/test/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregatorTests.java
@@ -20,10 +20,14 @@
 package org.elasticsearch.search.aggregations.metrics;
 
 import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.DoubleDocValuesField;
 import org.apache.lucene.document.IntPoint;
+import org.apache.lucene.document.LatLonDocValuesField;
 import org.apache.lucene.document.NumericDocValuesField;
 import org.apache.lucene.document.SortedDocValuesField;
 import org.apache.lucene.document.SortedNumericDocValuesField;
+import org.apache.lucene.document.SortedSetDocValuesField;
 import org.apache.lucene.index.DirectoryReader;
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.RandomIndexWriter;
@@ -35,6 +39,7 @@
 import org.apache.lucene.util.BytesRef;
 import org.elasticsearch.common.CheckedConsumer;
 import org.elasticsearch.common.geo.GeoPoint;
+import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.mapper.BooleanFieldMapper;
 import org.elasticsearch.index.mapper.DateFieldMapper;
 import org.elasticsearch.index.mapper.GeoPointFieldMapper;
@@ -44,22 +49,103 @@
 import org.elasticsearch.index.mapper.NumberFieldMapper;
 import org.elasticsearch.index.mapper.RangeFieldMapper;
 import org.elasticsearch.index.mapper.RangeType;
+import org.elasticsearch.script.MockScriptEngine;
+import org.elasticsearch.script.Script;
+import org.elasticsearch.script.ScriptEngine;
+import org.elasticsearch.script.ScriptModule;
+import org.elasticsearch.script.ScriptService;
+import org.elasticsearch.script.ScriptType;
+import org.elasticsearch.search.aggregations.AggregationBuilder;
 import org.elasticsearch.search.aggregations.AggregatorTestCase;
 import org.elasticsearch.search.aggregations.support.AggregationInspectionHelper;
+import org.elasticsearch.search.aggregations.support.CoreValuesSourceType;
 import org.elasticsearch.search.aggregations.support.ValueType;
+import org.elasticsearch.search.aggregations.support.ValuesSourceType;
 
 import java.io.IOException;
 import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
 import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
 import java.util.Set;
 import java.util.function.Consumer;
+import java.util.function.Function;
 
 import static java.util.Collections.singleton;
 
 public class ValueCountAggregatorTests extends AggregatorTestCase {
 
     private static final String FIELD_NAME = "field";
 
+    private static final String STRING_VALUE_SCRIPT = "string_value";
+    private static final String NUMBER_VALUE_SCRIPT = "number_value";
+    private static final String SINGLE_SCRIPT = "single";
+
+    @Override
+    protected AggregationBuilder createAggBuilderForTypeTest(MappedFieldType fieldType, String fieldName) {
+        return new ValueCountAggregationBuilder("foo", null).field(fieldName);
+    }
+
+    @Override
+    protected List<ValuesSourceType> getSupportedValuesSourceTypes() {
+        return Arrays.asList(
+            CoreValuesSourceType.NUMERIC,
+            CoreValuesSourceType.BYTES,
+            CoreValuesSourceType.GEOPOINT,
+            CoreValuesSourceType.RANGE
+        );
+    }
+
+    @Override
+    protected ScriptService getMockScriptService() {
+        Map<String, Function<Map<String, Object>, Object>> scripts = new HashMap<>();
+
+        scripts.put(STRING_VALUE_SCRIPT, vars -> (Double.valueOf((String) vars.get("_value")) + 1));
+        scripts.put(NUMBER_VALUE_SCRIPT, vars -> (((Number) vars.get("_value")).doubleValue() + 1));
+        scripts.put(SINGLE_SCRIPT, vars -> 1);
+
+        MockScriptEngine scriptEngine = new MockScriptEngine(MockScriptEngine.NAME,
+            scripts,
+            Collections.emptyMap());
+        Map<String, ScriptEngine> engines = Collections.singletonMap(scriptEngine.getType(), scriptEngine);
+
+        return new ScriptService(Settings.EMPTY, engines, ScriptModule.CORE_CONTEXTS);
+    }
+
+
+    public void testGeoField() throws IOException {
+        testCase(new MatchAllDocsQuery(), ValueType.GEOPOINT, iw -> {
+            for (int i = 0; i < 10; i++) {
+                Document document = new Document();
+                document.add(new LatLonDocValuesField("field", 10, 10));
+                iw.addDocument(document);
+            }
+        }, count -> assertEquals(10L, count.getValue()));
+    }
+
+    public void testDoubleField() throws IOException {
+        testCase(new MatchAllDocsQuery(), ValueType.DOUBLE, iw -> {
+            for (int i = 0; i < 15; i++) {
+                Document document = new Document();
+                document.add(new DoubleDocValuesField(FIELD_NAME, 23D));
+                iw.addDocument(document);
+            }
+        }, count -> assertEquals(15L, count.getValue()));
+    }
+
+    public void testKeyWordField() throws IOException {
+        testCase(new MatchAllDocsQuery(), ValueType.STRING, iw -> {
+            for (int i = 0; i < 20; i++) {
+                Document document = new Document();
+                document.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("stringValue")));
+                document.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("string11Value")));
+                iw.addDocument(document);
+            }
+        }, count -> assertEquals(40L, count.getValue()));
+    }
+
     public void testNoDocs() throws IOException {
         for (ValueType valueType : ValueType.values()) {
             testCase(new MatchAllDocsQuery(), valueType, iw -> {
@@ -189,6 +275,105 @@ public void testRangeFieldValues() throws IOException {
         }, fieldType);
     }
 
+    public void testValueScriptNumber() throws IOException {
+        ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
+            .field(FIELD_NAME)
+            .script(new Script(ScriptType.INLINE, MockScriptEngine.NAME, NUMBER_VALUE_SCRIPT, Collections.emptyMap()));
+
+        MappedFieldType fieldType = createMappedFieldType(ValueType.NUMERIC);
+        fieldType.setName(FIELD_NAME);
+        fieldType.setHasDocValues(true);
+
+        testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
+            iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 7)));
+            iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 8)));
+            iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 9)));
+        }, card -> {
+            assertEquals(3, card.getValue(), 0);
+            assertTrue(AggregationInspectionHelper.hasValue(card));
+        }, fieldType);
+    }
+
+    public void testSingleScriptNumber() throws IOException {
+        ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
+            .field(FIELD_NAME);
+
+        MappedFieldType fieldType = createMappedFieldType(ValueType.NUMERIC);
+        fieldType.setName(FIELD_NAME);
+        fieldType.setHasDocValues(true);
+
+        testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
+            Document doc = new Document();
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 7));
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 7));
+            iw.addDocument(doc);
+
+            doc = new Document();
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 8));
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 8));
+            iw.addDocument(doc);
+
+            doc = new Document();
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 1));
+            doc.add(new SortedNumericDocValuesField(FIELD_NAME, 1));
+            iw.addDocument(doc);
+        }, card -> {
+            // note: this is 6, even though the script returns a single value.  ValueCount does not de-dedupe
+            assertEquals(6, card.getValue(), 0);
+            assertTrue(AggregationInspectionHelper.hasValue(card));
+        }, fieldType);
+    }
+
+    public void testValueScriptString() throws IOException {
+        ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
+            .field(FIELD_NAME)
+            .script(new Script(ScriptType.INLINE, MockScriptEngine.NAME, STRING_VALUE_SCRIPT, Collections.emptyMap()));
+
+        MappedFieldType fieldType = createMappedFieldType(ValueType.STRING);
+        fieldType.setName(FIELD_NAME);
+        fieldType.setHasDocValues(true);
+
+        testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
+            iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("1"))));
+            iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("2"))));
+            iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("3"))));
+        }, card -> {
+            assertEquals(3, card.getValue(), 0);
+            assertTrue(AggregationInspectionHelper.hasValue(card));
+        }, fieldType);
+    }
+
+    public void testSingleScriptString() throws IOException {
+        ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
+            .field(FIELD_NAME);
+
+        MappedFieldType fieldType = createMappedFieldType(ValueType.STRING);
+        fieldType.setName(FIELD_NAME);
+        fieldType.setHasDocValues(true);
+
+        testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
+            Document doc = new Document();
+            // Note: unlike numerics, lucene de-dupes strings so we increment here
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("1")));
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("2")));
+            iw.addDocument(doc);
+
+            doc = new Document();
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("3")));
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("4")));
+            iw.addDocument(doc);
+
+            doc = new Document();
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("5")));
+            doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("6")));
+            iw.addDocument(doc);
+        }, card -> {
+            // note: this is 6, even though the script returns a single value.  ValueCount does not de-dedupe
+            assertEquals(6, card.getValue(), 0);
+            assertTrue(AggregationInspectionHelper.hasValue(card));
+        }, fieldType);
+    }
+
     private void testCase(Query query,
                           ValueType valueType,
                           CheckedConsumer<RandomIndexWriter, IOException> indexer,