-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Hadoop Plugin: Use HDFS as gateway storage #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hadoop Plugin: Use HDFS as gateway storage, closed by 28fa384. |
costin
pushed a commit
that referenced
this issue
Dec 6, 2022
This PR allows the Lucene source operator (that runs the search) to be parallelized, either by slicing on the document id space, or by slicing on the segment space. It also comes with a bunch of benchmarks to show the effects of running in various configurations. The experiment I looked at was just running the calculation of an average on a long field. To allow for parallelization, the avg operator comes in two flavors, allowing a map/reduce pattern: The first one (map) takes raw input (the numbers) and emits a sum + a count at the end, and the second one (reduce) takes sum/count pairs, sums them up and emits the avg at the end. Various configurations are tested: - testLongAvgSingleThreadedAvg: Running everything single-threaded with a single driver (for baseline performance) - testLongAvgMultiThreadedAvgWithSingleThreadedSearch: Running the search part single-threaded, but then parallelize the numeric doc value extraction and avg computation - testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch: Running the search part as well as avg computation in parallel, using segment-level parallelism - testLongAvgMultiThreadedAvgWithMultiThreadedSearch: Running the search part as well as avg computation in parallel, using document-id-space-level parallelism (see also https://issues.apache.org/jira/browse/LUCENE-8675) To understand the effect of number of segments, we're running the benchmark in two configurations (data force-merged to 1 segment, and data force-merged to 10 segments). Here are the results (from my MacBook Pro with 8 cores, albeit imprecise due to the warm temperatures in my office today with the extreme heat): ``` Benchmark (maxNumSegments) (numDocs) Mode Cnt Score Error Units OperatorBenchmark.testLongAvgSingleThreadedAvg 1 100000000 avgt 3 664.127 ± 63.200 ms/op OperatorBenchmark.testLongAvgSingleThreadedAvg 10 100000000 avgt 3 654.669 ± 88.197 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 1 100000000 avgt 3 153.785 ± 69.273 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 10 100000000 avgt 3 161.570 ± 172.318 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 1 100000000 avgt 3 687.172 ± 41.166 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 10 100000000 avgt 3 168.887 ± 81.306 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 1 100000000 avgt 3 111.377 ± 60.332 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 10 100000000 avgt 3 111.535 ± 87.793 ms/op ``` Some explanations for the results observed: - Even when keeping the search part single-threaded, it's useful to parallelize the aggregations running on-top. - The aggregations are very light-weight in this benchmark, so even if you have enough cores, the single-threaded search might still be the bottle-neck (as it's a match-all query, the bottle-neck in this case is creation of the arrays to store the doc ids). - Fully parallelizing things (i.e. the search part as well) can make things even faster. For segment-level parallelism, this obviously only works when you have multiple segments. In case you have only have a single segment, you can still parallelize only the aggregation bits, or you can do partitioning by id-space (will interfere with optimizations that leverage segment-level information)
costin
pushed a commit
that referenced
this issue
Dec 6, 2022
…198) This PR adds the basic infrastructure for turning a physical plan into a list of drivers that can (locally) execute the given physical plan. It adds the PlanNode class to represent a physical plan (which is just a tree / digraph of PlanNode objects). The PR assumes that this physical plan makes sense (i.e. it does not do any kind of extra validation). It then implements a LocalExecutionPlanner to turn the given plan into a list of drivers, allowing parallel execution of the given plan (in-so-far as parallelism has been designed into the plan). It covers all the parallel executions explored as part of #189, showing the flexibility of the planner.
cbuescher
pushed a commit
to cbuescher/elasticsearch
that referenced
this issue
Oct 2, 2023
With this commit we use pyenv to manage the Python installation for our Vagrant workflow. We also switch to Python 3.6 as our default Python version.
This issue was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A new plugin, called hadoop, to provide support for using HDFS as the gateway storage. Configuration is simple:
gateway.type
- set tohdfs
.gateway.hdfs.uri
- set to something likehdfs://hostname:port
.gateway.hdfs.path
- the path to store the data under.The text was updated successfully, but these errors were encountered: