Skip to content

Hadoop Plugin: Use HDFS as gateway storage #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kimchy opened this issue May 23, 2010 · 1 comment
Closed

Hadoop Plugin: Use HDFS as gateway storage #189

kimchy opened this issue May 23, 2010 · 1 comment

Comments

@kimchy
Copy link
Member

kimchy commented May 23, 2010

A new plugin, called hadoop, to provide support for using HDFS as the gateway storage. Configuration is simple:

  • gateway.type - set to hdfs.
  • gateway.hdfs.uri - set to something like hdfs://hostname:port.
  • gateway.hdfs.path - the path to store the data under.
@kimchy
Copy link
Member Author

kimchy commented May 23, 2010

Hadoop Plugin: Use HDFS as gateway storage, closed by 28fa384.

costin pushed a commit that referenced this issue Dec 6, 2022
This PR allows the Lucene source operator (that runs the search) to be parallelized, either by slicing on the document id space, or by slicing on the segment space.

It also comes with a bunch of benchmarks to show the effects of running in various configurations.

The experiment I looked at was just running the calculation of an average on a long field. To allow for parallelization, the avg operator comes in two flavors, allowing a map/reduce pattern: The first one (map) takes raw input (the numbers) and emits a sum + a count at the end, and the second one (reduce) takes sum/count pairs, sums them up and emits the avg at the end.

Various configurations are tested:
- testLongAvgSingleThreadedAvg: Running everything single-threaded with a single driver (for baseline performance)
- testLongAvgMultiThreadedAvgWithSingleThreadedSearch: Running the search part single-threaded, but then parallelize the numeric doc value extraction and avg computation
- testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch: Running the search part as well as avg computation in  parallel, using segment-level parallelism
- testLongAvgMultiThreadedAvgWithMultiThreadedSearch: Running the search part as well as avg computation in  parallel, using document-id-space-level parallelism (see also https://issues.apache.org/jira/browse/LUCENE-8675)

To understand the effect of number of segments, we're running the benchmark in two configurations (data force-merged to 1 segment, and data force-merged to 10 segments).

Here are the results (from my MacBook Pro with 8 cores, albeit imprecise due to the warm temperatures in my office today with the extreme heat):

```
Benchmark                                                                    (maxNumSegments)  (numDocs)  Mode  Cnt    Score     Error  Units
OperatorBenchmark.testLongAvgSingleThreadedAvg                                              1  100000000  avgt    3  664.127 ±  63.200  ms/op
OperatorBenchmark.testLongAvgSingleThreadedAvg                                             10  100000000  avgt    3  654.669 ±  88.197  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch                       1  100000000  avgt    3  153.785 ±  69.273  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithSingleThreadedSearch                      10  100000000  avgt    3  161.570 ± 172.318  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch                 1  100000000  avgt    3  687.172 ±  41.166  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSegmentSearch                10  100000000  avgt    3  168.887 ±  81.306  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch                        1  100000000  avgt    3  111.377 ±  60.332  ms/op
OperatorBenchmark.testLongAvgMultiThreadedAvgWithMultiThreadedSearch                       10  100000000  avgt    3  111.535 ±  87.793  ms/op
```

Some explanations for the results observed:
- Even when keeping the search part single-threaded, it's useful to parallelize the aggregations running on-top.
- The aggregations are very light-weight in this benchmark, so even if you have enough cores, the single-threaded search might still be the bottle-neck (as it's a match-all query, the bottle-neck in this case is creation of the arrays to store the doc ids).
- Fully parallelizing things (i.e. the search part as well) can make things even faster. For segment-level parallelism, this obviously only works when you have multiple segments. In case you have only have a single segment, you can still parallelize only the aggregation bits, or you can do partitioning by id-space (will interfere with optimizations that leverage segment-level information)
costin pushed a commit that referenced this issue Dec 6, 2022
…198)

This PR adds the basic infrastructure for turning a physical plan into a list of drivers that can (locally) execute the given physical plan. It adds the PlanNode class to represent a physical plan (which is just a tree / digraph of PlanNode objects). The PR assumes that this physical plan makes sense (i.e. it does not do any kind of extra validation). It then implements a LocalExecutionPlanner to turn the given plan into a list of drivers, allowing parallel execution of the given plan (in-so-far as parallelism has been designed into the plan). It covers all the parallel executions explored as part of #189, showing the flexibility of the planner.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023
With this commit we use pyenv to manage the Python installation for our
Vagrant workflow. We also switch to Python 3.6 as our default Python
version.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant