Let search phases override max concurrent requests #26484

jasontedor · 2017-09-03T20:18:10Z

If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow.

Closes #26198

jasontedor · 2017-09-03T20:41:15Z

This is fine in 5.6.0, the request are forked to the search thread pool there already.

If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we fork can match requests to the search thread pool to prevent this.

bleskes

LGTM

bleskes · 2017-09-04T09:46:04Z

core/src/test/java/org/elasticsearch/action/search/CanMatchIT.java

+     * up for an extended period of time). This test is for that situation.
+     */
+    public void testAvoidStackOverflow() throws InterruptedException {
+        final String node = internalCluster().startDataOnlyNode(Settings.builder().put("node.attr.color", "blue").build());


why do we need to start a node? can't we use an existing node and use it's _name for allocation filtering?

bleskes · 2017-09-04T09:49:24Z

core/src/test/java/org/elasticsearch/action/search/CanMatchIT.java

+     * query, the can match phase would recurse and end in stack overflow (and this thread would be a networking thread, tying such a thread
+     * up for an extended period of time). This test is for that situation.
+     */
+    public void testAvoidStackOverflow() throws InterruptedException {


nit - this isn't really a can match test but more an "can we run with lot's of shards on nodes" tests. I think it's good to have but maybe we can fold it into one of the generic search IT suites? maybe SimpleSearchIT (although nothing is simple in life ;)

bleskes · 2017-09-04T09:54:47Z

core/src/test/java/org/elasticsearch/action/search/CanMatchIT.java

+                Settings.builder().put("index.routing.allocation.include.color", "blue").put("index.number_of_shards", 640);
+        client().admin().indices().create(new CreateIndexRequest("index").settings(settings)).actionGet();
+        // it can take a long time for all the shards to allocate and initialize
+        ensureGreen(TimeValue.timeValueSeconds(Long.MAX_VALUE));


this is worrisome... how long does this test run normally? should we fallback to mocking?

This test is removed, I am now using mocking.

s1monw

I think we should fix this differently. This is only an issue since we are running into the maxConcurrentShardRequests which is unnecessary for this can_match action I wonder if we should rather opt out of it entirely otherwise we are subject to rejections which was partially the purpose of the can_match change.

jasontedor · 2017-09-04T14:19:22Z

@s1monw For this reason when I initially submitted the pull request I forked this requests to the generic thread pool. After seeing that in the 5.6 branch these requests are forked to the search thread pool I followed your lead there. I'll look into opting out of max concurrent shard requests and I'll reach out later this week to discuss all of our options.

s1monw · 2017-09-05T07:49:14Z

@s1monw For this reason when I initially submitted the pull request I forked this requests to the generic thread pool. After seeing that in the 5.6 branch these requests are forked to the search thread pool I followed your lead there.

lemme provide some history here since it's not obvious, in 5.6 do still fetch resouces in the rewrite phase for some queries. I fixed this in #25791 which allows to execute the rewrite phase on the network thread. The execution on the search threadpool will again cause potential rejections which we tried to prevent with moving to the network threadpool.

I'll look into opting out of max concurrent shard requests and I'll reach out later this week to discuss all of our options.

Looking forward to your suggestions

…rflow * origin/master: (59 commits) Fix Lucene version of 5.6.1. Remove azure deprecated settings (elastic#26099) Handle the 5.6.0 release Allow plugins to validate cluster-state on join (elastic#26595) Remove index mapper dynamic settings (elastic#25734) update AWS SDK for ECS Task IAM support in discovery-ec2 (elastic#26479) Azure repository: Accelerate the listing of files (used in delete snapshot) (elastic#25710) Build: Remove norelease from forbidden patterns (elastic#26592) Fix reference to painless inside expression engine (elastic#26528) Build: Move javadoc linking to root build.gradle (elastic#26529) Test: Remove leftover static bwc test case (elastic#26584) Docs: Remove remaining references to file and native scripts (elastic#26580) Snapshot fallback should consider build.snapshot elastic#26496: Set the correct bwc version after backport to 6.x Fix the MapperFieldType.rangeQuery API. (elastic#26552) Deduplicate `_field_names`. (elastic#26550) [Docs] Update method setSource(byte[] source) (elastic#26561) [Docs] Fix typo in javadocs (elastic#26556) Allow multiple digits in Vagrant 2.x minor versions Support Vagrant 2.x ...

jasontedor · 2017-09-13T02:30:22Z

@bleskes @s1monw I have updated this pull request, would you please take a look?

s1monw

LGTM left 2 minors thanks for doing this

s1monw · 2017-09-13T07:46:19Z

core/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java

@@ -52,7 +52,12 @@
    private final AtomicInteger shardExecutionIndex = new AtomicInteger(0);
    private final int maxConcurrentShardRequests;

-    InitialSearchPhase(String name, SearchRequest request, GroupShardsIterator<SearchShardIterator> shardsIts, Logger logger) {
+    InitialSearchPhase(


can we maybe not warp on every parameter please?

s1monw · 2017-09-13T07:47:10Z

core/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java

@@ -57,7 +53,7 @@
                                        SearchTask task, Function<GroupShardsIterator<SearchShardIterator>, SearchPhase> phaseFactory) {
        super("can_match", logger, searchTransportService, nodeIdToConnection, aliasFilter, concreteIndexBoosts, executor, request,
            listener,
-            shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size()));
+            shardsIts, timeProvider, clusterStateVersion, task, new BitSetSearchPhaseResults(shardsIts.size()), shardsIts.size());


can we leave a comment why we do this here?

If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow. Relates #26484

bleskes

Much cleaner. Thanks.

If the query coordinating node is also a data node that holds all the shards for a search request, we can end up recursing through the can match phase (because we send a local request and on response in the listener move to the next shard and do this again, without ever having returned from previous shards). This recursion can lead to stack overflow for even a reasonable number of indices (daily indices over a sixty days with five shards per day is enough to trigger the stack overflow). Moreover, all this execution would be happening on a network thread (the thread that initially received the query). With this commit, we allow search phases to override max concurrent requests. This allows the can match phase to avoid recursing through the shards towards a stack overflow. Relates #26484

jasontedor added :Search/Search Search-related issues that do not fall into other categories >bug review v5.6.0 v6.0.0 v6.1.0 v7.0.0 labels Sep 3, 2017

jasontedor requested a review from s1monw September 3, 2017 20:18

jasontedor mentioned this pull request Sep 3, 2017

Dockerized Elasticsearch instance crashes when receiving request #26198

Closed

jasontedor removed the v5.6.0 label Sep 3, 2017

jasontedor force-pushed the can-match-stack-overflow branch 2 times, most recently from aee166b to c0a0f10 Compare September 3, 2017 20:40

jasontedor changed the title ~~Fork can match requests to the generic thread pool~~ Fork can match requests to the search thread pool Sep 3, 2017

jasontedor force-pushed the can-match-stack-overflow branch from c0a0f10 to 4fcaf80 Compare September 3, 2017 20:47

jasontedor added 2 commits September 3, 2017 18:42

Increase ensure green timeout

29c6383

Increase timeout

3ef260c

bleskes approved these changes Sep 4, 2017

View reviewed changes

bleskes reviewed Sep 4, 2017

View reviewed changes

s1monw suggested changes Sep 4, 2017

View reviewed changes

jasontedor added 2 commits September 12, 2017 14:31

Iteration

29ae196

jasontedor requested review from s1monw and bleskes September 13, 2017 02:29

Cleanup

09c800b

jasontedor changed the title ~~Fork can match requests to the search thread pool~~ Let search phases override max concurrent requests Sep 13, 2017

s1monw approved these changes Sep 13, 2017

View reviewed changes

Iteration

2d6b741

jasontedor merged commit b3e7e85 into elastic:master Sep 13, 2017

jasontedor deleted the can-match-stack-overflow branch September 13, 2017 10:28

bleskes reviewed Sep 13, 2017

View reviewed changes

colings86 added v6.0.0-rc1 and removed v6.0.0 labels Sep 22, 2017

jasontedor mentioned this pull request Oct 31, 2017

StackOverflowError after a Fatal Error on Network Layer #27184

Closed

dakrone added the v5.6.4 label Oct 31, 2017

lcawl removed the v6.1.0 label Dec 12, 2017

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let search phases override max concurrent requests #26484

Let search phases override max concurrent requests #26484

Uh oh!

jasontedor commented Sep 3, 2017 •

edited

Loading

Uh oh!

jasontedor commented Sep 3, 2017

Uh oh!

bleskes left a comment

Uh oh!

bleskes Sep 4, 2017

Uh oh!

bleskes Sep 4, 2017

Uh oh!

bleskes Sep 4, 2017

Uh oh!

jasontedor Sep 13, 2017

Uh oh!

s1monw left a comment

Uh oh!

jasontedor commented Sep 4, 2017

Uh oh!

s1monw commented Sep 5, 2017

Uh oh!

jasontedor commented Sep 13, 2017

Uh oh!

s1monw left a comment

Uh oh!

s1monw Sep 13, 2017

Uh oh!

s1monw Sep 13, 2017

Uh oh!

bleskes left a comment

Uh oh!

Uh oh!

Let search phases override max concurrent requests #26484

Let search phases override max concurrent requests #26484

Uh oh!

Conversation

jasontedor commented Sep 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasontedor commented Sep 3, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

bleskes Sep 4, 2017

Choose a reason for hiding this comment

Uh oh!

bleskes Sep 4, 2017

Choose a reason for hiding this comment

Uh oh!

bleskes Sep 4, 2017

Choose a reason for hiding this comment

Uh oh!

jasontedor Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Sep 4, 2017

Uh oh!

s1monw commented Sep 5, 2017

Uh oh!

jasontedor commented Sep 13, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

s1monw Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jasontedor commented Sep 3, 2017 •

edited

Loading