-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Let search phases override max concurrent requests #26484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let search phases override max concurrent requests #26484
Conversation
aee166b
to
c0a0f10
Compare
This is fine in 5.6.0, the request are forked to the search thread pool there already. |
If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we fork can match requests to the search thread pool to prevent this.
c0a0f10
to
4fcaf80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* up for an extended period of time). This test is for that situation. | ||
*/ | ||
public void testAvoidStackOverflow() throws InterruptedException { | ||
final String node = internalCluster().startDataOnlyNode(Settings.builder().put("node.attr.color", "blue").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to start a node? can't we use an existing node and use it's _name for allocation filtering?
* query, the can match phase would recurse and end in stack overflow (and this thread would be a networking thread, tying such a thread | ||
* up for an extended period of time). This test is for that situation. | ||
*/ | ||
public void testAvoidStackOverflow() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - this isn't really a can match test but more an "can we run with lot's of shards on nodes" tests. I think it's good to have but maybe we can fold it into one of the generic search IT suites? maybe SimpleSearchIT
(although nothing is simple in life ;)
Settings.builder().put("index.routing.allocation.include.color", "blue").put("index.number_of_shards", 640); | ||
client().admin().indices().create(new CreateIndexRequest("index").settings(settings)).actionGet(); | ||
// it can take a long time for all the shards to allocate and initialize | ||
ensureGreen(TimeValue.timeValueSeconds(Long.MAX_VALUE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is worrisome... how long does this test run normally? should we fallback to mocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is removed, I am now using mocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix this differently. This is only an issue since we are running into the maxConcurrentShardRequests
which is unnecessary for this can_match
action I wonder if we should rather opt out of it entirely otherwise we are subject to rejections which was partially the purpose of the can_match
change.
@s1monw For this reason when I initially submitted the pull request I forked this requests to the generic thread pool. After seeing that in the 5.6 branch these requests are forked to the search thread pool I followed your lead there. I'll look into opting out of max concurrent shard requests and I'll reach out later this week to discuss all of our options. |
lemme provide some history here since it's not obvious, in 5.6 do still fetch resouces in the rewrite phase for some queries. I fixed this in #25791 which allows to execute the rewrite phase on the network thread. The execution on the search threadpool will again cause potential rejections which we tried to prevent with moving to the network threadpool.
Looking forward to your suggestions |
…rflow * origin/master: (59 commits) Fix Lucene version of 5.6.1. Remove azure deprecated settings (elastic#26099) Handle the 5.6.0 release Allow plugins to validate cluster-state on join (elastic#26595) Remove index mapper dynamic settings (elastic#25734) update AWS SDK for ECS Task IAM support in discovery-ec2 (elastic#26479) Azure repository: Accelerate the listing of files (used in delete snapshot) (elastic#25710) Build: Remove norelease from forbidden patterns (elastic#26592) Fix reference to painless inside expression engine (elastic#26528) Build: Move javadoc linking to root build.gradle (elastic#26529) Test: Remove leftover static bwc test case (elastic#26584) Docs: Remove remaining references to file and native scripts (elastic#26580) Snapshot fallback should consider build.snapshot elastic#26496: Set the correct bwc version after backport to 6.x Fix the MapperFieldType.rangeQuery API. (elastic#26552) Deduplicate `_field_names`. (elastic#26550) [Docs] Update method setSource(byte[] source) (elastic#26561) [Docs] Fix typo in javadocs (elastic#26556) Allow multiple digits in Vagrant 2.x minor versions Support Vagrant 2.x ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM left 2 minors thanks for doing this
@@ -52,7 +52,12 @@ | |||
private final AtomicInteger shardExecutionIndex = new AtomicInteger(0); | |||
private final int maxConcurrentShardRequests; | |||
|
|||
InitialSearchPhase(String name, SearchRequest request, GroupShardsIterator<SearchShardIterator> shardsIts, Logger logger) { | |||
InitialSearchPhase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we maybe not warp on every parameter please?
@@ -57,7 +53,7 @@ | |||
SearchTask task, Function<GroupShardsIterator<SearchShardIterator>, SearchPhase> phaseFactory) { | |||
super("can_match", logger, searchTransportService, nodeIdToConnection, aliasFilter, concreteIndexBoosts, executor, request, | |||
listener, | |||
shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size())); | |||
shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size()), shardsIts.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we leave a comment why we do this here?
If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow. Relates #26484
If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow. Relates #26484
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner. Thanks.
If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow. Relates #26484
If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow.
Closes #26198