Hadoop Plugin: Use HDFS as gateway storage #189

kimchy · 2010-05-23T14:01:36Z

A new plugin, called hadoop, to provide support for using HDFS as the gateway storage. Configuration is simple:

gateway.type - set to hdfs.
gateway.hdfs.uri - set to something like hdfs://hostname:port.
gateway.hdfs.path - the path to store the data under.

The text was updated successfully, but these errors were encountered:

kimchy · 2010-05-23T14:02:15Z

Hadoop Plugin: Use HDFS as gateway storage, closed by 28fa384.

This PR allows the Lucene source operator (that runs the search) to be parallelized, either by slicing on the document id space, or by slicing on the segment space. It also comes with a bunch of benchmarks to show the effects of running in various configurations. The experiment I looked at was just running the calculation of an average on a long field. To allow for parallelization, the avg operator comes in two flavors, allowing a map/reduce pattern: The first one (map) takes raw input (the numbers) and emits a sum + a count at the end, and the second one (reduce) takes sum/count pairs, sums them up and emits the avg at the end. Various configurations are tested: - testLongAvgSingleThreadedAvg: Running everything single-threaded with a single driver (for baseline performance) - testLongAvgMultiThreadedAvgWithSingleThreadedSearch: Running the search part single-threaded, but then parallelize the numeric doc value extraction and avg computation - testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch: Running the search part as well as avg computation in parallel, using segment-level parallelism - testLongAvgMultiThreadedAvgWithMultiThreadedSearch: Running the search part as well as avg computation in parallel, using document-id-space-level parallelism (see also https://issues.apache.org/jira/browse/LUCENE-8675) To understand the effect of number of segments, we're running the benchmark in two configurations (data force-merged to 1 segment, and data force-merged to 10 segments). Here are the results (from my MacBook Pro with 8 cores, albeit imprecise due to the warm temperatures in my office today with the extreme heat): ``` Benchmark (maxNumSegments) (numDocs) Mode Cnt Score Error Units OperatorBenchmark.testLongAvgSingleThreadedAvg 1 100000000 avgt 3 664.127 ± 63.200 ms/op OperatorBenchmark.testLongAvgSingleThreadedAvg 10 100000000 avgt 3 654.669 ± 88.197 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 1 100000000 avgt 3 153.785 ± 69.273 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch 10 100000000 avgt 3 161.570 ± 172.318 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 1 100000000 avgt 3 687.172 ± 41.166 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch 10 100000000 avgt 3 168.887 ± 81.306 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 1 100000000 avgt 3 111.377 ± 60.332 ms/op OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch 10 100000000 avgt 3 111.535 ± 87.793 ms/op ``` Some explanations for the results observed: - Even when keeping the search part single-threaded, it's useful to parallelize the aggregations running on-top. - The aggregations are very light-weight in this benchmark, so even if you have enough cores, the single-threaded search might still be the bottle-neck (as it's a match-all query, the bottle-neck in this case is creation of the arrays to store the doc ids). - Fully parallelizing things (i.e. the search part as well) can make things even faster. For segment-level parallelism, this obviously only works when you have multiple segments. In case you have only have a single segment, you can still parallelize only the aggregation bits, or you can do partitioning by id-space (will interfere with optimizations that leverage segment-level information)

…198) This PR adds the basic infrastructure for turning a physical plan into a list of drivers that can (locally) execute the given physical plan. It adds the PlanNode class to represent a physical plan (which is just a tree / digraph of PlanNode objects). The PR assumes that this physical plan makes sense (i.e. it does not do any kind of extra validation). It then implements a LocalExecutionPlanner to turn the given plan into a list of drivers, allowing parallel execution of the given plan (in-so-far as parallelism has been designed into the plan). It covers all the parallel executions explored as part of #189, showing the flexibility of the planner.

With this commit we use pyenv to manage the Python installation for our Vagrant workflow. We also switch to Python 3.6 as our default Python version.

makeyang mentioned this issue Sep 7, 2015

es 0.90.2 plus jdk6.0_25-b06 crashed on production #13368

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop Plugin: Use HDFS as gateway storage #189

Hadoop Plugin: Use HDFS as gateway storage #189

kimchy commented May 23, 2010

kimchy commented May 23, 2010

Hadoop Plugin: Use HDFS as gateway storage #189

Hadoop Plugin: Use HDFS as gateway storage #189

Comments

kimchy commented May 23, 2010

kimchy commented May 23, 2010