Batch query phase shard level requests per data node #112306

javanna · 2024-08-28T18:36:34Z

The query phase fans out to all shards, sending as many shard level requests as the number of shards involved to the relevant data nodes. Years ago we have reworked the can match phase (as part of the many shards effort) to group shard level requests per data node, in order to decrease the number of roundtrips required (including authorization) and the overhead at the transport level. We would like to do the same for the query phase. We want to start small and scope this to only query phase (no DFS or query after DFS, no scroll), and only when there are aggs in the provided search request. That is because this is the type of requests that go through potentially many shards, and use quite a bit of memory on the coordinating node.

We expect that changing the execution model will provide better stability ,as well as better resource usage. In fact, currently the coordinating node throttles to 5 (configurable) concurrent shard requests per data node. If we group shard level requests to a single request per data node, each data node is going to be able to have more context about the portion of the search request it is requested to execute, and may execute its shard level requests at its own pace, depending on current load etc. We have seen that the current throttling mechanism can be a bottleneck, that prevents maximizing resource usage on data node. At the same time, this improvement would drastically reduce the network roundtrips from being a factor of the number of shards for the query phase, to a factor of the number of data nodes involved in the search request.

elasticsearchmachine · 2024-08-28T18:36:57Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

original-brownbear · 2025-01-20T01:20:27Z

List of outstanding tasks (WIP to some degree, but filling in the details over the next few hours):

The main PR that will close this issue can be found in #118490. The goal is merging this PR.

What is absolutely needed to merge this PR:

Extract common AsyncSearchContext interface from PR WIP: Batch + move parts of reduce during query phase execution to data nodes #118490
Extract wire format changes to query result to separate PR
Align on details of what to display for number of reduce phases and errors in terms aggs that change (for the better) with this work
Align on whether or not to have a circuit breaker when reducing on the data node side

Good to have to reduce risk or make full use of the work:

The main bottleneck after this change is the coordinating node networking. We send an enormous amount of data over the wire and resolving targeted indices is very slow as well (in fact for querying O(50K) shards it's the by far slowest step for most non-aggregation searches). There's a number of open pull requests addressing this issue by optimizing the logic already:

Future ideas/steps to build on top of this:

Remove the can_match phase for queries covered by batched execution, it's 100% redundant for them
Introduce async search API and log activation #120024 align on this PR. Currently scheduling on the search pool is rather random, we can do a better job here and save lots of block time and heap by ordering execution better.
lz4 or otherwise compress responses to reduce the message size in case data-node side reduction isn't enough to create a reasonably small response. Use cases that don't partial-reduce well still compress extremely well at the byte level.
Optimize partial reduce to cover more cases where possible
Reuse Lucene (mostly aggs leaf collectors) data structures across shards to hard bound their memory use (this would be an extremely impactful memory saver)
Make use of this logic for CCS as well (might do that in the initial PR lets discuss) and remove minimize_roundtrips (that's a future thing)
Batch the fetch phase (but also paginate in case of large results) as well as other possible optimizations around fetch:
- run aggregation (or at least their reduction) concurrently to fetch
- return fetch response directly in more cases that the single shard scenario we currently optimize. A simple next step here would be leveraging the new batching to fetch right away if a query only targets a single data node

javanna · 2025-04-02T08:01:41Z

Implemented by #121885 .

javanna added :Search Foundations/Search Catch all for Search Foundations >feature labels Aug 28, 2024

javanna assigned original-brownbear and piergm Aug 28, 2024

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Aug 28, 2024

javanna mentioned this issue Aug 28, 2024

Is the default max_concurrent_shard_requests (5) too low? #60197

Closed

original-brownbear added the Meta label Jan 20, 2025

javanna closed this as completed Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch query phase shard level requests per data node #112306

Batch query phase shard level requests per data node #112306

javanna commented Aug 28, 2024 •

edited

Loading

elasticsearchmachine commented Aug 28, 2024

original-brownbear commented Jan 20, 2025 •

edited

Loading

javanna commented Apr 2, 2025

Batch query phase shard level requests per data node #112306

Batch query phase shard level requests per data node #112306

Comments

javanna commented Aug 28, 2024 • edited Loading

elasticsearchmachine commented Aug 28, 2024

original-brownbear commented Jan 20, 2025 • edited Loading

javanna commented Apr 2, 2025

javanna commented Aug 28, 2024 •

edited

Loading

original-brownbear commented Jan 20, 2025 •

edited

Loading