Skip to content

Let search phases override max concurrent requests #26484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 13, 2017

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Sep 3, 2017

If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow.

Closes #26198

@jasontedor jasontedor added :Search/Search Search-related issues that do not fall into other categories >bug review v5.6.0 v6.0.0 v6.1.0 v7.0.0 labels Sep 3, 2017
@jasontedor jasontedor requested a review from s1monw September 3, 2017 20:18
@jasontedor jasontedor removed the v5.6.0 label Sep 3, 2017
@jasontedor jasontedor force-pushed the can-match-stack-overflow branch 2 times, most recently from aee166b to c0a0f10 Compare September 3, 2017 20:40
@jasontedor jasontedor changed the title Fork can match requests to the generic thread pool Fork can match requests to the search thread pool Sep 3, 2017
@jasontedor
Copy link
Member Author

This is fine in 5.6.0, the request are forked to the search thread pool there already.

If the query coordinating node is also a data node that holds all the
shards for a search request, we can end up recursing through the can
match phase (because we send a local request and on response in the
listener move to the next shard and do this again, without ever having
returned from previous shards). This recursion can lead to stack
overflow for even a reasonable number of indices (daily indices over a
sixty days with five shards per day is enough to trigger the stack
overflow). Moreover, all this execution would be happening on a network
thread (the thread that initially received the query). With this commit,
we fork can match requests to the search thread pool to prevent this.
@jasontedor jasontedor force-pushed the can-match-stack-overflow branch from c0a0f10 to 4fcaf80 Compare September 3, 2017 20:47
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* up for an extended period of time). This test is for that situation.
*/
public void testAvoidStackOverflow() throws InterruptedException {
final String node = internalCluster().startDataOnlyNode(Settings.builder().put("node.attr.color", "blue").build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to start a node? can't we use an existing node and use it's _name for allocation filtering?

* query, the can match phase would recurse and end in stack overflow (and this thread would be a networking thread, tying such a thread
* up for an extended period of time). This test is for that situation.
*/
public void testAvoidStackOverflow() throws InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - this isn't really a can match test but more an "can we run with lot's of shards on nodes" tests. I think it's good to have but maybe we can fold it into one of the generic search IT suites? maybe SimpleSearchIT (although nothing is simple in life ;)

Settings.builder().put("index.routing.allocation.include.color", "blue").put("index.number_of_shards", 640);
client().admin().indices().create(new CreateIndexRequest("index").settings(settings)).actionGet();
// it can take a long time for all the shards to allocate and initialize
ensureGreen(TimeValue.timeValueSeconds(Long.MAX_VALUE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is worrisome... how long does this test run normally? should we fallback to mocking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is removed, I am now using mocking.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix this differently. This is only an issue since we are running into the maxConcurrentShardRequests which is unnecessary for this can_match action I wonder if we should rather opt out of it entirely otherwise we are subject to rejections which was partially the purpose of the can_match change.

@jasontedor
Copy link
Member Author

@s1monw For this reason when I initially submitted the pull request I forked this requests to the generic thread pool. After seeing that in the 5.6 branch these requests are forked to the search thread pool I followed your lead there. I'll look into opting out of max concurrent shard requests and I'll reach out later this week to discuss all of our options.

@s1monw
Copy link
Contributor

s1monw commented Sep 5, 2017

@s1monw For this reason when I initially submitted the pull request I forked this requests to the generic thread pool. After seeing that in the 5.6 branch these requests are forked to the search thread pool I followed your lead there.

lemme provide some history here since it's not obvious, in 5.6 do still fetch resouces in the rewrite phase for some queries. I fixed this in #25791 which allows to execute the rewrite phase on the network thread. The execution on the search threadpool will again cause potential rejections which we tried to prevent with moving to the network threadpool.

I'll look into opting out of max concurrent shard requests and I'll reach out later this week to discuss all of our options.

Looking forward to your suggestions

…rflow

* origin/master: (59 commits)
  Fix Lucene version of 5.6.1.
  Remove azure deprecated settings (elastic#26099)
  Handle the 5.6.0 release
  Allow plugins to validate cluster-state on join (elastic#26595)
  Remove index mapper dynamic settings (elastic#25734)
  update AWS SDK for ECS Task IAM support in discovery-ec2 (elastic#26479)
  Azure repository: Accelerate the listing of files (used in delete snapshot) (elastic#25710)
  Build: Remove norelease from forbidden patterns (elastic#26592)
  Fix reference to painless inside expression engine (elastic#26528)
  Build: Move javadoc linking to root build.gradle (elastic#26529)
  Test: Remove leftover static bwc test case (elastic#26584)
  Docs: Remove remaining references to file and native scripts (elastic#26580)
  Snapshot fallback should consider build.snapshot
  elastic#26496: Set the correct bwc version after backport to 6.x
  Fix the MapperFieldType.rangeQuery API. (elastic#26552)
  Deduplicate `_field_names`. (elastic#26550)
  [Docs] Update method setSource(byte[] source) (elastic#26561)
  [Docs] Fix typo in javadocs (elastic#26556)
  Allow multiple digits in Vagrant 2.x minor versions
  Support Vagrant 2.x
  ...
@jasontedor
Copy link
Member Author

@bleskes @s1monw I have updated this pull request, would you please take a look?

@jasontedor jasontedor changed the title Fork can match requests to the search thread pool Let search phases override max concurrent requests Sep 13, 2017
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM left 2 minors thanks for doing this

@@ -52,7 +52,12 @@
private final AtomicInteger shardExecutionIndex = new AtomicInteger(0);
private final int maxConcurrentShardRequests;

InitialSearchPhase(String name, SearchRequest request, GroupShardsIterator<SearchShardIterator> shardsIts, Logger logger) {
InitialSearchPhase(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe not warp on every parameter please?

@@ -57,7 +53,7 @@
SearchTask task, Function<GroupShardsIterator<SearchShardIterator>, SearchPhase> phaseFactory) {
super("can_match", logger, searchTransportService, nodeIdToConnection, aliasFilter, concreteIndexBoosts, executor, request,
listener,
shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size()));
shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size()), shardsIts.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave a comment why we do this here?

@jasontedor jasontedor merged commit b3e7e85 into elastic:master Sep 13, 2017
jasontedor added a commit that referenced this pull request Sep 13, 2017
If the query coordinating node is also a data node that holds all the
shards for a search request, we can end up recursing through the can
match phase (because we send a local request and on response in the
listener move to the next shard and do this again, without ever having
returned from previous shards). This recursion can lead to stack
overflow for even a reasonable number of indices (daily indices over a
sixty days with five shards per day is enough to trigger the stack
overflow). Moreover, all this execution would be happening on a network
thread (the thread that initially received the query). With this commit,
we allow search phases to override max concurrent requests. This allows
the can match phase to avoid recursing through the shards towards a
stack overflow.

Relates #26484
jasontedor added a commit that referenced this pull request Sep 13, 2017
If the query coordinating node is also a data node that holds all the
shards for a search request, we can end up recursing through the can
match phase (because we send a local request and on response in the
listener move to the next shard and do this again, without ever having
returned from previous shards). This recursion can lead to stack
overflow for even a reasonable number of indices (daily indices over a
sixty days with five shards per day is enough to trigger the stack
overflow). Moreover, all this execution would be happening on a network
thread (the thread that initially received the query). With this commit,
we allow search phases to override max concurrent requests. This allows
the can match phase to avoid recursing through the shards towards a
stack overflow.

Relates #26484
@jasontedor jasontedor deleted the can-match-stack-overflow branch September 13, 2017 10:28
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner. Thanks.

jasontedor added a commit that referenced this pull request Oct 31, 2017
If the query coordinating node is also a data node that holds all the
shards for a search request, we can end up recursing through the can
match phase (because we send a local request and on response in the
listener move to the next shard and do this again, without ever having
returned from previous shards). This recursion can lead to stack
overflow for even a reasonable number of indices (daily indices over a
sixty days with five shards per day is enough to trigger the stack
overflow). Moreover, all this execution would be happening on a network
thread (the thread that initially received the query). With this commit,
we allow search phases to override max concurrent requests. This allows
the can match phase to avoid recursing through the shards towards a
stack overflow.

Relates #26484
@dakrone dakrone added the v5.6.4 label Oct 31, 2017
@lcawl lcawl removed the v6.1.0 label Dec 12, 2017
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v5.6.4 v6.0.0-rc1 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants