Skip to content

Commit b1afca1

Browse files
xjtushileinik9000
authored andcommitted
value_count Aggregation optimization (elastic#54854)
We found some problems during the test. Data: 200Million docs, 1 shard, 0 replica hits | avg | sum | value_count | ----------- | ------- | ------- | ----------- | 20,000 | .038s | .033s | .063s | 200,000 | .127s | .125s | .334s | 2,000,000 | .789s | .729s | 3.176s | 20,000,000 | 4.200s | 3.239s | 22.787s | 200,000,000 | 21.000s | 22.000s | 154.917s | The performance of `avg`, `sum` and other is very close when performing statistics, but the performance of `value_count` has always been poor, even not on an order of magnitude. Based on some common-sense knowledge, we think that `value_count` and sum are similar operations, and the time consumed should be the same. Therefore, we have discussed the agg of `value_count`. The principle of counting in es is to traverse the field of each document. If the field is an ordinary value, the count value is increased by 1. If it is an array type, the count value is increased by n. However, the problem lies in traversing each document and taking out the field, which changes from disk to an object in the Java language. We summarize its current problems with Elasticsearch as: - Number cast to string overhead, and GC problems caused by a large number of strings - After the number type is converted to string, sorting and other unnecessary operations are performed Here is the proof of type conversion overhead. ``` // Java long to string source code, getChars is very time-consuming. public static String toString(long i) { int size = stringSize(i); if (COMPACT_STRINGS) { byte[] buf = new byte[size]; getChars(i, size, buf); return new String(buf, LATIN1); } else { byte[] buf = new byte[size * 2]; StringUTF16.getChars(i, size, buf); return new String(buf, UTF16); } } ``` test type | average | min | max | sum ------------ | ------- | ---- | ----------- | ------- double->long | 32.2ns | 28ns | 0.024ms | 3.22s long->double | 31.9ns | 28ns | 0.036ms | 3.19s long->String | 163.8ns | 93ns | 1921 ms | 16.3s particularly serious. Our optimization code is actually very simple. It is to manage different types separately, instead of uniformly converting to string unified processing. We added type identification in ValueCountAggregator, and made special treatment for number and geopoint types to cancel their type conversion. Because the string type is reduced and the string constant is reduced, the improvement effect is very obvious. hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count | | | | double | double | keyword | keyword | geo_point | geo_point | | | | before | after | before | after | before | after | ----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | 20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s | 200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s | 2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s | 20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s | 200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s | - The results are more in line with common sense. `value_count` is about the same as `avg`, `sum`, etc., or even lower than these. Previously, `value_count` was much larger than avg and sum, and it was not even an order of magnitude when the amount of data was large. - When calculating numeric types such as `double` and `long`, the performance is improved by about 8 to 9 times; when calculating the `geo_point` type, the performance is improved by 18 to 20 times.
1 parent da976d2 commit b1afca1

File tree

4 files changed

+231
-0
lines changed

4 files changed

+231
-0
lines changed

docs/reference/release-notes.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
This section summarizes the changes in each release.
88

9+
* <<release-notes-7.8.0>>
910
* <<release-notes-7.7.0>>
1011
* <<release-notes-7.6.2>>
1112
* <<release-notes-7.6.1>>
@@ -32,6 +33,7 @@ This section summarizes the changes in each release.
3233

3334
--
3435

36+
include::release-notes/7.8.asciidoc[]
3537
include::release-notes/7.7.asciidoc[]
3638
include::release-notes/7.6.asciidoc[]
3739
include::release-notes/7.5.asciidoc[]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[[release-notes-7.8.0]]
2+
== {es} version 7.8.0
3+
4+
coming[7.8.0]
5+
6+
[[breaking-7.8.0]]
7+
[float]
8+
=== Breaking changes
9+
10+
Search::
11+
* Scripts used in `value_count` will now receive a number if they are counting
12+
a numeric field and a `GeoPoint` if they are counting a `geo_point` fields.
13+
They used to always receive the `String` representation of those values.
14+
{pull}54854[#54854]

server/src/main/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregator.java

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@
1919
package org.elasticsearch.search.aggregations.metrics;
2020

2121
import org.apache.lucene.index.LeafReaderContext;
22+
import org.apache.lucene.index.SortedNumericDocValues;
2223
import org.elasticsearch.common.lease.Releasables;
2324
import org.elasticsearch.common.util.BigArrays;
2425
import org.elasticsearch.common.util.LongArray;
26+
import org.elasticsearch.index.fielddata.MultiGeoPointValues;
2527
import org.elasticsearch.index.fielddata.SortedBinaryDocValues;
2628
import org.elasticsearch.search.aggregations.Aggregator;
2729
import org.elasticsearch.search.aggregations.InternalAggregation;
@@ -62,6 +64,34 @@ public LeafBucketCollector getLeafCollector(LeafReaderContext ctx,
6264
return LeafBucketCollector.NO_OP_COLLECTOR;
6365
}
6466
final BigArrays bigArrays = context.bigArrays();
67+
68+
if (valuesSource instanceof ValuesSource.Numeric) {
69+
final SortedNumericDocValues values = ((ValuesSource.Numeric)valuesSource).longValues(ctx);
70+
return new LeafBucketCollectorBase(sub, values) {
71+
72+
@Override
73+
public void collect(int doc, long bucket) throws IOException {
74+
counts = bigArrays.grow(counts, bucket + 1);
75+
if (values.advanceExact(doc)) {
76+
counts.increment(bucket, values.docValueCount());
77+
}
78+
}
79+
};
80+
}
81+
if (valuesSource instanceof ValuesSource.Bytes.GeoPoint) {
82+
MultiGeoPointValues values = ((ValuesSource.GeoPoint)valuesSource).geoPointValues(ctx);
83+
return new LeafBucketCollectorBase(sub, null) {
84+
85+
@Override
86+
public void collect(int doc, long bucket) throws IOException {
87+
counts = bigArrays.grow(counts, bucket + 1);
88+
if (values.advanceExact(doc)) {
89+
counts.increment(bucket, values.docValueCount());
90+
}
91+
}
92+
};
93+
}
94+
// The following is default collector. Including the keyword FieldType
6595
final SortedBinaryDocValues values = valuesSource.bytesValues(ctx);
6696
return new LeafBucketCollectorBase(sub, values) {
6797

server/src/test/java/org/elasticsearch/search/aggregations/metrics/ValueCountAggregatorTests.java

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,14 @@
2020
package org.elasticsearch.search.aggregations.metrics;
2121

2222
import org.apache.lucene.document.BinaryDocValuesField;
23+
import org.apache.lucene.document.Document;
24+
import org.apache.lucene.document.DoubleDocValuesField;
2325
import org.apache.lucene.document.IntPoint;
26+
import org.apache.lucene.document.LatLonDocValuesField;
2427
import org.apache.lucene.document.NumericDocValuesField;
2528
import org.apache.lucene.document.SortedDocValuesField;
2629
import org.apache.lucene.document.SortedNumericDocValuesField;
30+
import org.apache.lucene.document.SortedSetDocValuesField;
2731
import org.apache.lucene.index.DirectoryReader;
2832
import org.apache.lucene.index.IndexReader;
2933
import org.apache.lucene.index.RandomIndexWriter;
@@ -35,6 +39,7 @@
3539
import org.apache.lucene.util.BytesRef;
3640
import org.elasticsearch.common.CheckedConsumer;
3741
import org.elasticsearch.common.geo.GeoPoint;
42+
import org.elasticsearch.common.settings.Settings;
3843
import org.elasticsearch.index.mapper.BooleanFieldMapper;
3944
import org.elasticsearch.index.mapper.DateFieldMapper;
4045
import org.elasticsearch.index.mapper.GeoPointFieldMapper;
@@ -44,22 +49,103 @@
4449
import org.elasticsearch.index.mapper.NumberFieldMapper;
4550
import org.elasticsearch.index.mapper.RangeFieldMapper;
4651
import org.elasticsearch.index.mapper.RangeType;
52+
import org.elasticsearch.script.MockScriptEngine;
53+
import org.elasticsearch.script.Script;
54+
import org.elasticsearch.script.ScriptEngine;
55+
import org.elasticsearch.script.ScriptModule;
56+
import org.elasticsearch.script.ScriptService;
57+
import org.elasticsearch.script.ScriptType;
58+
import org.elasticsearch.search.aggregations.AggregationBuilder;
4759
import org.elasticsearch.search.aggregations.AggregatorTestCase;
4860
import org.elasticsearch.search.aggregations.support.AggregationInspectionHelper;
61+
import org.elasticsearch.search.aggregations.support.CoreValuesSourceType;
4962
import org.elasticsearch.search.aggregations.support.ValueType;
63+
import org.elasticsearch.search.aggregations.support.ValuesSourceType;
5064

5165
import java.io.IOException;
5266
import java.util.Arrays;
67+
import java.util.Collections;
68+
import java.util.HashMap;
5369
import java.util.HashSet;
70+
import java.util.List;
71+
import java.util.Map;
5472
import java.util.Set;
5573
import java.util.function.Consumer;
74+
import java.util.function.Function;
5675

5776
import static java.util.Collections.singleton;
5877

5978
public class ValueCountAggregatorTests extends AggregatorTestCase {
6079

6180
private static final String FIELD_NAME = "field";
6281

82+
private static final String STRING_VALUE_SCRIPT = "string_value";
83+
private static final String NUMBER_VALUE_SCRIPT = "number_value";
84+
private static final String SINGLE_SCRIPT = "single";
85+
86+
@Override
87+
protected AggregationBuilder createAggBuilderForTypeTest(MappedFieldType fieldType, String fieldName) {
88+
return new ValueCountAggregationBuilder("foo", null).field(fieldName);
89+
}
90+
91+
@Override
92+
protected List<ValuesSourceType> getSupportedValuesSourceTypes() {
93+
return Arrays.asList(
94+
CoreValuesSourceType.NUMERIC,
95+
CoreValuesSourceType.BYTES,
96+
CoreValuesSourceType.GEOPOINT,
97+
CoreValuesSourceType.RANGE
98+
);
99+
}
100+
101+
@Override
102+
protected ScriptService getMockScriptService() {
103+
Map<String, Function<Map<String, Object>, Object>> scripts = new HashMap<>();
104+
105+
scripts.put(STRING_VALUE_SCRIPT, vars -> (Double.valueOf((String) vars.get("_value")) + 1));
106+
scripts.put(NUMBER_VALUE_SCRIPT, vars -> (((Number) vars.get("_value")).doubleValue() + 1));
107+
scripts.put(SINGLE_SCRIPT, vars -> 1);
108+
109+
MockScriptEngine scriptEngine = new MockScriptEngine(MockScriptEngine.NAME,
110+
scripts,
111+
Collections.emptyMap());
112+
Map<String, ScriptEngine> engines = Collections.singletonMap(scriptEngine.getType(), scriptEngine);
113+
114+
return new ScriptService(Settings.EMPTY, engines, ScriptModule.CORE_CONTEXTS);
115+
}
116+
117+
118+
public void testGeoField() throws IOException {
119+
testCase(new MatchAllDocsQuery(), ValueType.GEOPOINT, iw -> {
120+
for (int i = 0; i < 10; i++) {
121+
Document document = new Document();
122+
document.add(new LatLonDocValuesField("field", 10, 10));
123+
iw.addDocument(document);
124+
}
125+
}, count -> assertEquals(10L, count.getValue()));
126+
}
127+
128+
public void testDoubleField() throws IOException {
129+
testCase(new MatchAllDocsQuery(), ValueType.DOUBLE, iw -> {
130+
for (int i = 0; i < 15; i++) {
131+
Document document = new Document();
132+
document.add(new DoubleDocValuesField(FIELD_NAME, 23D));
133+
iw.addDocument(document);
134+
}
135+
}, count -> assertEquals(15L, count.getValue()));
136+
}
137+
138+
public void testKeyWordField() throws IOException {
139+
testCase(new MatchAllDocsQuery(), ValueType.STRING, iw -> {
140+
for (int i = 0; i < 20; i++) {
141+
Document document = new Document();
142+
document.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("stringValue")));
143+
document.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("string11Value")));
144+
iw.addDocument(document);
145+
}
146+
}, count -> assertEquals(40L, count.getValue()));
147+
}
148+
63149
public void testNoDocs() throws IOException {
64150
for (ValueType valueType : ValueType.values()) {
65151
testCase(new MatchAllDocsQuery(), valueType, iw -> {
@@ -189,6 +275,105 @@ public void testRangeFieldValues() throws IOException {
189275
}, fieldType);
190276
}
191277

278+
public void testValueScriptNumber() throws IOException {
279+
ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
280+
.field(FIELD_NAME)
281+
.script(new Script(ScriptType.INLINE, MockScriptEngine.NAME, NUMBER_VALUE_SCRIPT, Collections.emptyMap()));
282+
283+
MappedFieldType fieldType = createMappedFieldType(ValueType.NUMERIC);
284+
fieldType.setName(FIELD_NAME);
285+
fieldType.setHasDocValues(true);
286+
287+
testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
288+
iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 7)));
289+
iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 8)));
290+
iw.addDocument(singleton(new NumericDocValuesField(FIELD_NAME, 9)));
291+
}, card -> {
292+
assertEquals(3, card.getValue(), 0);
293+
assertTrue(AggregationInspectionHelper.hasValue(card));
294+
}, fieldType);
295+
}
296+
297+
public void testSingleScriptNumber() throws IOException {
298+
ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
299+
.field(FIELD_NAME);
300+
301+
MappedFieldType fieldType = createMappedFieldType(ValueType.NUMERIC);
302+
fieldType.setName(FIELD_NAME);
303+
fieldType.setHasDocValues(true);
304+
305+
testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
306+
Document doc = new Document();
307+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 7));
308+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 7));
309+
iw.addDocument(doc);
310+
311+
doc = new Document();
312+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 8));
313+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 8));
314+
iw.addDocument(doc);
315+
316+
doc = new Document();
317+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 1));
318+
doc.add(new SortedNumericDocValuesField(FIELD_NAME, 1));
319+
iw.addDocument(doc);
320+
}, card -> {
321+
// note: this is 6, even though the script returns a single value. ValueCount does not de-dedupe
322+
assertEquals(6, card.getValue(), 0);
323+
assertTrue(AggregationInspectionHelper.hasValue(card));
324+
}, fieldType);
325+
}
326+
327+
public void testValueScriptString() throws IOException {
328+
ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
329+
.field(FIELD_NAME)
330+
.script(new Script(ScriptType.INLINE, MockScriptEngine.NAME, STRING_VALUE_SCRIPT, Collections.emptyMap()));
331+
332+
MappedFieldType fieldType = createMappedFieldType(ValueType.STRING);
333+
fieldType.setName(FIELD_NAME);
334+
fieldType.setHasDocValues(true);
335+
336+
testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
337+
iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("1"))));
338+
iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("2"))));
339+
iw.addDocument(singleton(new SortedDocValuesField(FIELD_NAME, new BytesRef("3"))));
340+
}, card -> {
341+
assertEquals(3, card.getValue(), 0);
342+
assertTrue(AggregationInspectionHelper.hasValue(card));
343+
}, fieldType);
344+
}
345+
346+
public void testSingleScriptString() throws IOException {
347+
ValueCountAggregationBuilder aggregationBuilder = new ValueCountAggregationBuilder("name", null)
348+
.field(FIELD_NAME);
349+
350+
MappedFieldType fieldType = createMappedFieldType(ValueType.STRING);
351+
fieldType.setName(FIELD_NAME);
352+
fieldType.setHasDocValues(true);
353+
354+
testCase(aggregationBuilder, new MatchAllDocsQuery(), iw -> {
355+
Document doc = new Document();
356+
// Note: unlike numerics, lucene de-dupes strings so we increment here
357+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("1")));
358+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("2")));
359+
iw.addDocument(doc);
360+
361+
doc = new Document();
362+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("3")));
363+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("4")));
364+
iw.addDocument(doc);
365+
366+
doc = new Document();
367+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("5")));
368+
doc.add(new SortedSetDocValuesField(FIELD_NAME, new BytesRef("6")));
369+
iw.addDocument(doc);
370+
}, card -> {
371+
// note: this is 6, even though the script returns a single value. ValueCount does not de-dedupe
372+
assertEquals(6, card.getValue(), 0);
373+
assertTrue(AggregationInspectionHelper.hasValue(card));
374+
}, fieldType);
375+
}
376+
192377
private void testCase(Query query,
193378
ValueType valueType,
194379
CheckedConsumer<RandomIndexWriter, IOException> indexer,

0 commit comments

Comments
 (0)