You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR allows the Lucene source operator (that runs the search) to be parallelized, either by slicing on the document id space, or by slicing on the segment space.
It also comes with a bunch of benchmarks to show the effects of running in various configurations.
The experiment I looked at was just running the calculation of an average on a long field. To allow for parallelization, the avg operator comes in two flavors, allowing a map/reduce pattern: The first one (map) takes raw input (the numbers) and emits a sum + a count at the end, and the second one (reduce) takes sum/count pairs, sums them up and emits the avg at the end.
Various configurations are tested:
- testLongAvgSingleThreadedAvg: Running everything single-threaded with a single driver (for baseline performance)
- testLongAvgMultiThreadedAvgWithSingleThreadedSearch: Running the search part single-threaded, but then parallelize the numeric doc value extraction and avg computation
- testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch: Running the search part as well as avg computation in parallel, using segment-level parallelism
- testLongAvgMultiThreadedAvgWithMultiThreadedSearch: Running the search part as well as avg computation in parallel, using document-id-space-level parallelism (see also https://issues.apache.org/jira/browse/LUCENE-8675)
To understand the effect of number of segments, we're running the benchmark in two configurations (data force-merged to 1 segment, and data force-merged to 10 segments).
Here are the results (from my MacBook Pro with 8 cores, albeit imprecise due to the warm temperatures in my office today with the extreme heat):
```
Benchmark (maxNumSegments) (numDocs) Mode Cnt Score Error Units
OperatorBenchmark.testLongAvgSingleThreadedAvg 1 100000000 avgt 3 664.127 ± 63.200 ms/op
OperatorBenchmark.testLongAvgSingleThreadedAvg 10 100000000 avgt 3 654.669 ± 88.197 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 1 100000000 avgt 3 153.785 ± 69.273 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 10 100000000 avgt 3 161.570 ± 172.318 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 1 100000000 avgt 3 687.172 ± 41.166 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 10 100000000 avgt 3 168.887 ± 81.306 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 1 100000000 avgt 3 111.377 ± 60.332 ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 10 100000000 avgt 3 111.535 ± 87.793 ms/op
```
Some explanations for the results observed:
- Even when keeping the search part single-threaded, it's useful to parallelize the aggregations running on-top.
- The aggregations are very light-weight in this benchmark, so even if you have enough cores, the single-threaded search might still be the bottle-neck (as it's a match-all query, the bottle-neck in this case is creation of the arrays to store the doc ids).
- Fully parallelizing things (i.e. the search part as well) can make things even faster. For segment-level parallelism, this obviously only works when you have multiple segments. In case you have only have a single segment, you can still parallelize only the aggregation bits, or you can do partitioning by id-space (will interfere with optimizations that leverage segment-level information)
Copy file name to clipboardExpand all lines: x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/action/compute/lucene/LuceneSourceOperator.java
0 commit comments