Skip to content

Commit e99d287

Browse files
authored
Add Variable Width Histogram Aggregation (#42035)
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving #50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.
1 parent 48f4a8d commit e99d287

File tree

21 files changed

+3115
-1
lines changed

21 files changed

+3115
-1
lines changed

client/rest-high-level/src/main/java/org/elasticsearch/client/RestHighLevelClient.java

+4
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@
111111
import org.elasticsearch.search.aggregations.bucket.histogram.ParsedAutoDateHistogram;
112112
import org.elasticsearch.search.aggregations.bucket.histogram.ParsedDateHistogram;
113113
import org.elasticsearch.search.aggregations.bucket.histogram.ParsedHistogram;
114+
import org.elasticsearch.search.aggregations.bucket.histogram.ParsedVariableWidthHistogram;
115+
import org.elasticsearch.search.aggregations.bucket.histogram.VariableWidthHistogramAggregationBuilder;
114116
import org.elasticsearch.search.aggregations.bucket.missing.MissingAggregationBuilder;
115117
import org.elasticsearch.search.aggregations.bucket.missing.ParsedMissing;
116118
import org.elasticsearch.search.aggregations.bucket.nested.NestedAggregationBuilder;
@@ -1929,6 +1931,8 @@ static List<NamedXContentRegistry.Entry> getDefaultNamedXContents() {
19291931
map.put(HistogramAggregationBuilder.NAME, (p, c) -> ParsedHistogram.fromXContent(p, (String) c));
19301932
map.put(DateHistogramAggregationBuilder.NAME, (p, c) -> ParsedDateHistogram.fromXContent(p, (String) c));
19311933
map.put(AutoDateHistogramAggregationBuilder.NAME, (p, c) -> ParsedAutoDateHistogram.fromXContent(p, (String) c));
1934+
map.put(VariableWidthHistogramAggregationBuilder.NAME,
1935+
(p, c) -> ParsedVariableWidthHistogram.fromXContent(p, (String) c));
19321936
map.put(StringTerms.NAME, (p, c) -> ParsedStringTerms.fromXContent(p, (String) c));
19331937
map.put(LongTerms.NAME, (p, c) -> ParsedLongTerms.fromXContent(p, (String) c));
19341938
map.put(DoubleTerms.NAME, (p, c) -> ParsedDoubleTerms.fromXContent(p, (String) c));
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
[[search-aggregations-bucket-variablewidthhistogram-aggregation]]
2+
=== Variable Width Histogram Aggregation
3+
4+
This is a multi-bucket aggregation similar to <<search-aggregations-bucket-histogram-aggregation>>.
5+
However, the width of each bucket is not specified. Rather, a target number of buckets is provided and bucket intervals
6+
are dynamically determined based on the document distribution. This is done using a simple one-pass document clustering algorithm
7+
that aims to obtain low distances between bucket centroids. Unlike other multi-bucket aggregations, the intervals will not
8+
necessarily have a uniform width.
9+
10+
TIP: The number of buckets returned will always be less than or equal to the target number.
11+
12+
Requesting a target of 2 buckets.
13+
14+
[source,console]
15+
--------------------------------------------------
16+
POST /sales/_search?size=0
17+
{
18+
"aggs" : {
19+
"prices" : {
20+
"variable_width_histogram" : {
21+
"field" : "price",
22+
"buckets" : 2
23+
}
24+
}
25+
}
26+
}
27+
--------------------------------------------------
28+
// TEST[setup:sales]
29+
30+
Response:
31+
32+
[source,console-result]
33+
--------------------------------------------------
34+
{
35+
...
36+
"aggregations": {
37+
"prices" : {
38+
"buckets": [
39+
{
40+
"min": 10.0,
41+
"key": 30.0,
42+
"max": 50.0,
43+
"doc_count": 2
44+
},
45+
{
46+
"min": 150.0,
47+
"key": 185.0,
48+
"max": 200.0,
49+
"doc_count": 5
50+
}
51+
]
52+
}
53+
}
54+
}
55+
--------------------------------------------------
56+
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
57+
58+
==== Clustering Algorithm
59+
Each shard fetches the first `initial_buffer` documents and stores them in memory. Once the buffer is full, these documents
60+
are sorted and linearly separated into `3/4 * shard_size buckets`.
61+
Next each remaining documents is either collected into the nearest bucket, or placed into a new bucket if it is distant
62+
from all the existing ones. At most `shard_size` total buckets are created.
63+
64+
In the reduce step, the coordinating node sorts the buckets from all shards by their centroids. Then, the two buckets
65+
with the nearest centroids are repeatedly merged until the target number of buckets is achieved.
66+
This merging procedure is a form of https://en.wikipedia.org/wiki/Hierarchical_clustering[agglomerative hierarchical clustering].
67+
68+
TIP: A shard can return fewer than `shard_size` buckets, but it cannot return more.
69+
70+
==== Shard size
71+
The `shard_size` parameter specifies the number of buckets that the coordinating node will request from each shard.
72+
A higher `shard_size` leads each shard to produce smaller buckets. This reduce the likelihood of buckets overlapping
73+
after the reduction step. Increasing the `shard_size` will improve the accuracy of the histogram, but it will
74+
also make it more expensive to compute the final result because bigger priority queues will have to be managed on a
75+
shard level, and the data transfers between the nodes and the client will be larger.
76+
77+
TIP: Parameters `buckets`, `shard_size`, and `initial_buffer` are optional. By default, `buckets = 10`, `shard_size = 500` and `initial_buffer = min(50 * shard_size, 50000)`.
78+
79+
==== Initial Buffer
80+
The `initial_buffer` parameter can be used to specify the number of individual documents that will be stored in memory
81+
on a shard before the initial bucketing algorithm is run. Bucket distribution is determined using this sample
82+
of `initial_buffer` documents. So, although a higher `initial_buffer` will use more memory, it will lead to more representative
83+
clusters.
84+
85+
==== Bucket bounds are approximate
86+
During the reduce step, the master node continuously merges the two buckets with the nearest centroids. If two buckets have
87+
overlapping bounds but distant centroids, then it is possible that they will not be merged. Because of this, after
88+
reduction the maximum value in some interval (`max`) might be greater than the minimum value in the subsequent
89+
bucket (`min`). To reduce the impact of this error, when such an overlap occurs the bound between these intervals is adjusted to be `(max + min) / 2`.
90+
91+
TIP: Bucket bounds are very sensitive to outliers
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
setup:
2+
- do:
3+
indices.create:
4+
index: test
5+
body:
6+
settings:
7+
number_of_replicas: 0
8+
mappings:
9+
properties:
10+
number:
11+
type: integer
12+
- do:
13+
bulk:
14+
refresh: true
15+
index: test
16+
body:
17+
- '{"index": {}}'
18+
- '{"number": -3}'
19+
- '{"index": {}}'
20+
- '{"number": -2}'
21+
- '{"index": {}}'
22+
- '{"number": 1}'
23+
- '{"index": {}}'
24+
- '{"number": 4}'
25+
- '{"index": {}}'
26+
- '{"number": 5}'
27+
28+
---
29+
"basic":
30+
- skip:
31+
version: " - 7.99.99"
32+
reason: added in 8.0.0 (to be backported to 7.9.0)
33+
- do:
34+
search:
35+
body:
36+
size: 0
37+
aggs:
38+
histo:
39+
variable_width_histogram:
40+
field: number
41+
buckets: 3
42+
- match: { hits.total.value: 5 }
43+
- length: { aggregations.histo.buckets: 3 }
44+
- match: { aggregations.histo.buckets.0.key: -2.5 }
45+
- match: { aggregations.histo.buckets.0.doc_count: 2 }
46+
- match: { aggregations.histo.buckets.1.key: 1.0 }
47+
- match: { aggregations.histo.buckets.1.doc_count: 1 }
48+
- match: { aggregations.histo.buckets.2.key: 4.5 }
49+
- match: { aggregations.histo.buckets.2.doc_count: 2 }
50+

server/src/main/java/org/elasticsearch/common/util/BigArrays.java

+30
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,35 @@ public DoubleArray grow(DoubleArray array, long minSize) {
691691
return resize(array, newSize);
692692
}
693693

694+
public static class DoubleBinarySearcher extends BinarySearcher{
695+
696+
DoubleArray array;
697+
double searchFor;
698+
699+
public DoubleBinarySearcher(DoubleArray array){
700+
this.array = array;
701+
this.searchFor = Integer.MIN_VALUE;
702+
}
703+
704+
@Override
705+
protected int compare(int index) {
706+
// Prevent use of BinarySearcher.search() and force the use of DoubleBinarySearcher.search()
707+
assert this.searchFor != Integer.MIN_VALUE;
708+
709+
return Double.compare(array.get(index), searchFor);
710+
}
711+
712+
@Override
713+
protected double distance(int index) {
714+
return Math.abs(array.get(index) - searchFor);
715+
}
716+
717+
public int search(int from, int to, double searchFor) {
718+
this.searchFor = searchFor;
719+
return super.search(from, to);
720+
}
721+
}
722+
694723
/**
695724
* Allocate a new {@link FloatArray}.
696725
* @param size the initial length of the array
@@ -782,3 +811,4 @@ public <T> ObjectArray<T> grow(ObjectArray<T> array, long minSize) {
782811
return resize(array, newSize);
783812
}
784813
}
814+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.common.util;
21+
22+
/**
23+
* Performs binary search on an arbitrary data structure.
24+
*
25+
* To do a search, create a subclass and implement custom {@link #compare(int)} and {@link #distance(int)} methods.
26+
*
27+
* {@link BinarySearcher} knows nothing about the value being searched for or the underlying data structure.
28+
* These things should be determined by the subclass in its overridden methods.
29+
*
30+
* Refer to {@link BigArrays.DoubleBinarySearcher} for an example.
31+
*
32+
* NOTE: this class is not thread safe
33+
*/
34+
public abstract class BinarySearcher{
35+
36+
/**
37+
* @return a negative integer, zero, or a positive integer if the array's value at <code>index</code> is less than,
38+
* equal to, or greater than the value being searched for.
39+
*/
40+
protected abstract int compare(int index);
41+
42+
/**
43+
* @return the magnitude of the distance between the element at <code>index</code> and the value being searched for.
44+
* It will usually be <code>Math.abs(array[index] - searchValue)</code>.
45+
*/
46+
protected abstract double distance(int index);
47+
48+
/**
49+
* @return the index who's underlying value is closest to the value being searched for.
50+
*/
51+
private int getClosestIndex(int index1, int index2){
52+
if(distance(index1) < distance(index2)){
53+
return index1;
54+
} else {
55+
return index2;
56+
}
57+
}
58+
59+
/**
60+
* Uses a binary search to determine the index of the element within the index range {from, ... , to} that is
61+
* closest to the search value.
62+
*
63+
* Unlike most binary search implementations, the value being searched for is not an argument to search method.
64+
* Rather, this value should be stored by the subclass along with the underlying array.
65+
*
66+
* @return the index of the closest element.
67+
*
68+
* Requires: The underlying array should be sorted.
69+
**/
70+
public int search(int from, int to){
71+
while(from < to){
72+
int mid = (from + to) >>> 1;
73+
int compareResult = compare(mid);
74+
75+
if(compareResult == 0){
76+
// arr[mid] == value
77+
return mid;
78+
} else if(compareResult < 0){
79+
// arr[mid] < val
80+
81+
if(mid < to) {
82+
// Check if val is between (mid, mid + 1) before setting left = mid + 1
83+
// (mid < to) ensures that mid + 1 is not out of bounds
84+
int compareValAfterMid = compare(mid + 1);
85+
if (compareValAfterMid > 0) {
86+
return getClosestIndex(mid, mid + 1);
87+
}
88+
} else if(mid == to){
89+
// val > arr[mid] and there are no more elements above mid, so mid is the closest
90+
return mid;
91+
}
92+
93+
from = mid + 1;
94+
} else{
95+
// arr[mid] > val
96+
97+
if(mid > from) {
98+
// Check if val is between (mid - 1, mid)
99+
// (mid > from) ensures that mid - 1 is not out of bounds
100+
int compareValBeforeMid = compare(mid - 1);
101+
if (compareValBeforeMid < 0) {
102+
// val is between indices (mid - 1), mid
103+
return getClosestIndex(mid, mid - 1);
104+
}
105+
} else if(mid == 0){
106+
// val < arr[mid] and there are no more candidates below mid, so mid is the closest
107+
return mid;
108+
}
109+
110+
to = mid - 1;
111+
}
112+
}
113+
114+
return from;
115+
}
116+
117+
}

server/src/main/java/org/elasticsearch/search/SearchModule.java

+7
Original file line numberDiff line numberDiff line change
@@ -114,9 +114,11 @@
114114
import org.elasticsearch.search.aggregations.bucket.global.GlobalAggregationBuilder;
115115
import org.elasticsearch.search.aggregations.bucket.global.InternalGlobal;
116116
import org.elasticsearch.search.aggregations.bucket.histogram.AutoDateHistogramAggregationBuilder;
117+
import org.elasticsearch.search.aggregations.bucket.histogram.VariableWidthHistogramAggregationBuilder;
117118
import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregationBuilder;
118119
import org.elasticsearch.search.aggregations.bucket.histogram.HistogramAggregationBuilder;
119120
import org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram;
121+
import org.elasticsearch.search.aggregations.bucket.histogram.InternalVariableWidthHistogram;
120122
import org.elasticsearch.search.aggregations.bucket.histogram.InternalDateHistogram;
121123
import org.elasticsearch.search.aggregations.bucket.histogram.InternalHistogram;
122124
import org.elasticsearch.search.aggregations.bucket.missing.InternalMissing;
@@ -432,6 +434,11 @@ private ValuesSourceRegistry registerAggregations(List<SearchPlugin> plugins) {
432434
AutoDateHistogramAggregationBuilder.PARSER)
433435
.addResultReader(InternalAutoDateHistogram::new)
434436
.setAggregatorRegistrar(AutoDateHistogramAggregationBuilder::registerAggregators), builder);
437+
registerAggregation(new AggregationSpec(VariableWidthHistogramAggregationBuilder.NAME,
438+
VariableWidthHistogramAggregationBuilder::new,
439+
VariableWidthHistogramAggregationBuilder.PARSER)
440+
.addResultReader(InternalVariableWidthHistogram::new)
441+
.setAggregatorRegistrar(VariableWidthHistogramAggregationBuilder::registerAggregators), builder);
435442
registerAggregation(new AggregationSpec(GeoDistanceAggregationBuilder.NAME, GeoDistanceAggregationBuilder::new,
436443
GeoDistanceAggregationBuilder::parse)
437444
.addResultReader(InternalGeoDistance::new)

server/src/main/java/org/elasticsearch/search/aggregations/Aggregation.java

+4
Original file line numberDiff line numberDiff line change
@@ -68,5 +68,9 @@ final class CommonFields extends ParseField.CommonFields {
6868
public static final ParseField FROM_AS_STRING = new ParseField("from_as_string");
6969
public static final ParseField TO = new ParseField("to");
7070
public static final ParseField TO_AS_STRING = new ParseField("to_as_string");
71+
public static final ParseField MIN = new ParseField("min");
72+
public static final ParseField MIN_AS_STRING = new ParseField("min_as_string");
73+
public static final ParseField MAX = new ParseField("max");
74+
public static final ParseField MAX_AS_STRING = new ParseField("max_as_string");
7175
}
7276
}

server/src/main/java/org/elasticsearch/search/aggregations/bucket/BucketsAggregator.java

+6
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,12 @@ public final void collectExistingBucket(LeafBucketCollector subCollector, int do
101101
subCollector.collect(doc, bucketOrd);
102102
}
103103

104+
/**
105+
* This only tidies up doc counts. Call {@link MergingBucketsDeferringCollector#mergeBuckets(long[])} to merge the actual
106+
* ordinals and doc ID deltas.
107+
*
108+
* Refer to that method for documentation about the merge map.
109+
*/
104110
public final void mergeBuckets(long[] mergeMap, long newNumBuckets) {
105111
try (IntArray oldDocCounts = docCounts) {
106112
docCounts = bigArrays.newIntArray(newNumBuckets, true);

0 commit comments

Comments
 (0)