Add the ability to partition a scroll in multiple slices. #18237

jimczi · 2016-05-10T10:21:04Z

API:

curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '{
    "slice": {
        "field": "_uid", <1>
        "id": 0, <2>
        "max": 10 <3>
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

<1> (optional) The field name used to do the slicing (_uid by default)
<2> The id of the slice

By default the splitting is done on the shards first and then locally on each shard using the _uid field
with the following formula:
slice(doc) = floorMod(hashCode(doc._uid), max)
For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
to the first shard and the slices 1 and 3 are assigned to the second shard.

Each scroll is independent and can be processed in parallel like any scroll request.

Closes #13494

jimczi · 2016-05-10T10:37:46Z

Relates to #13494 (comment) / @s1monw

s1monw · 2016-05-10T10:49:58Z

hey @jimferenczi this is awesome. Yet, I have concerns about the field data. Loading all the stuff into fielddata only has a real benefit if it's reused but I am not sure if that is the common case. I wonder if we should use something like a dedicated query that we can use that walks the uid term dictionary and calculates the hash lazily for each value and omits the ones that don't match the bucket? I think it's not making much of a difference but will be much nicer in terms of memory consumption and scalability? I even think we can just use uid by default and don't worry about making it configurable in the first place? I mean we can come back to it but it's really just for reproducibility, right?

jimczi · 2016-05-10T20:46:45Z

@s1monw I've pushed another change that uses a BitSet per sliced scroll instead of building a specialized fieldcache for the _uid field: f391a59
As you mentioned earlier this reduces the cost in memory to 1 bit per document per slice (was 4 bytes per document total with the fielddata).

@jpountz I needed a way to tell the query cache that certain query should never be cached. For instance the SliceTermQuery builds a bitset for the toplevel reader associated with a sliced scroll. This bitset is valid during the lifetime of the scroll but should be freed when the scroll is cleared.
Could you please take a look at the NotCacheable interface and the way I use it ?

s1monw · 2016-05-11T12:07:45Z

core/src/main/java/org/elasticsearch/search/slice/SliceBuilder.java

+        String field = null;
+        int id = -1;
+        int max = -1;
+        while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {


we should use org.elasticsearch.common.xcontent.ObjectParser for this instead of parsing all the stuff ourself

s1monw · 2016-05-11T12:42:48Z

@jimferenczi I left a bunch of comments

jpountz · 2016-05-11T13:03:51Z

@jimferenczi For caching I think we have two options: we can either make the top-level index reader part of the query's equals/hashcode definitions so that the cache would only be used if the query is reused on the same top-level reader. This means that the query will only be cached in practice if the index is pretty static.

In case we really never want to cache (eg. if these queries are cheaper than a cached DocIdSet, like the MatchAllDocsQuery iterator for instance), then your QueryCachingPolicy wrapper looks good to me. I would just move it to its own class instead of an anonymous class in IndexShard?

jpountz · 2016-05-11T14:23:38Z

core/src/main/java/org/elasticsearch/search/builder/SearchSourceBuilder.java

@@ -1380,6 +1408,7 @@ public boolean equals(Object obj) {
                && Objects.equals(size, other.size)
                && Objects.equals(sorts, other.sorts)
                && Objects.equals(searchAfterBuilder, other.searchAfterBuilder)
+                && Objects.equals(sliceBuilder, other.sliceBuilder)


it should be used in the hashcode too

jpountz · 2016-05-11T15:07:44Z

TermsSliceQuery annoys me a bit given that the point of this pull request is to be able to process the output of a scroll in parallel. Yet the createWeight operation runs in O(maxDoc), so it defeats the point?

jimczi · 2016-05-11T16:13:18Z

@jpountz the TermsSliceQuery is just a way to avoid the usage of the fielddata on the _uid field.

In case we really never want to cache (eg. if these queries are cheaper than a cached DocIdSet)

They are not cheaper.

The DocValuesSliceQuery is not faster but avoids the bitset entirely.
The main problem with the bitset is that we heavily rely on it, if it's already built we're lightning fast and if not we're terribly slow. It's hard to control the I/O in the query cache and I wanted to avoid cache miss due to the activity on the node.
I think it's better to never cache the DocValuesSliceQuery.

yet the createWeight operation runs in O(maxDoc)

The TermsSliceQuery is supposed to be a cached DocIdSet but I wanted to avoid the QueryCachingPolicy. I agree that it should use the query cache but only if we can ensure that this entry will reliably stay around during the lifespan of the scroll. Currently this is achieved by keeping the query (and the bitset built during the initial scroll query) in the SearchContext.

jpountz · 2016-05-11T16:46:07Z

Forcing the query cache te hold on an entry for a certain amount of time would require to add more APIs, which I'd like to avoid. So I think the current approach to cache at the query/searchcontext level is better?

jpountz · 2016-05-11T16:47:47Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        }
+        cachingPolicy = new QueryCachingPolicy() {


can you move it to its own class, eg. ElasticsearchQueryCachingPolicy?

s1monw · 2016-05-12T09:34:49Z

@jimferenczi I had a conversation with @jpountz and I think we should go without the sorting altogether in the first iteration of this feature. I think we should have another conversation how we can implement this feature really just as a parallel version of scroll that only guarantees that we partition all docs within a single scroll ID. ie. we somehow need to maintain a context per slice on the same scroll context. I am not sure about the implementation yet but lets have a chat about it first. I also talked to @costin and we are good with that.

clintongormley · 2016-06-03T12:42:23Z

heya @jimferenczi - i've just thought of a potential problem with the current API: each scroll request no longer represents the same snapshot-in-time if indexing continues. In practice I don't know if this really is an issue...

jimczi · 2016-06-03T13:09:18Z

@clintongormley this is exactly what we wanted when this issue started. This means that the slices are independent and that they can be processed completely independently. The snapshot should be valid only during the lifespan of a single slice. The nice thing about slicing on _uid (or any pseudo random field that is never updated) is that we cannot miss any document. You can have new updates and deletes between two slices but it doesn't change the fact that each slice is consistent.

clintongormley · 2016-06-03T13:18:06Z

@jimferenczi yeah, I think that the fact that the point-in-time may be slightly different per slice doesn't make any real difference. The same already applies to multiple shards.

jimczi · 2016-06-03T15:47:06Z

@s1monw I pushed another commit to address the memory explosion. The slices are now splitted across shards like you suggested, so for instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned to the first shard and the slices 1 and 3 are assigned to the second shard. This means that the total number of bitsets needed is bounded by the number of slices instead of (numShards*numSlices).

s1monw · 2016-06-07T09:23:15Z

core/src/main/java/org/elasticsearch/search/slice/SliceBuilder.java

+                // in such case we can reduce the number of requested shards by slice
+
+                // first we check if the slice is responsible of this shard
+                int targetShard = id % numShards;


should we use Math.floorMod here just to be sure?

If id or numShards are negative we're screwed anyway so using floorMod would just hide the problem ? We always check that id is positive so it should not be a problem.

s1monw · 2016-06-07T09:32:20Z

I left minors but I love it! thanks for doing it.

s1monw · 2016-06-07T09:33:17Z

core/src/main/java/org/elasticsearch/search/slice/SliceQuery.java

@@ -66,7 +74,7 @@ public int hashCode() {
    }

    @Override
-    public String toString(String field) {
+    public String toString(String f) {


jpountz · 2016-06-07T10:36:57Z

Wonderful, LGTM!

jimczi · 2016-06-07T11:53:06Z

Wunderbar, thanks @jpountz and @s1monw.
I pushed another commit to address your last comments.

s1monw · 2016-06-07T12:52:04Z

core/src/main/java/org/elasticsearch/search/slice/SliceBuilder.java

+            }
+            // get the new slice id for this shard
+            int shardSlice = id / numShards;
+            if (useTermQuery) {


minor nit pick return useTermQuery ? new TermsSliceQuery(field, shardSlice, numSlicesInShard):new DocValuesSliceQuery(field, shardSlice, numSlicesInShard);

s1monw · 2016-06-07T12:53:23Z

LGTM no need for another review!

API: ``` curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '{ "slice": { "field": "_uid", <1> "id": 0, <2> "max": 10 <3> }, "query": { "match" : { "title" : "elasticsearch" } } } ``` <1> (optional) The field name used to do the slicing (_uid by default) <2> The id of the slice By default the splitting is done on the shards first and then locally on each shard using the _uid field with the following formula: `slice(doc) = floorMod(hashCode(doc._uid), max)` For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned to the first shard and the slices 1 and 3 are assigned to the second shard. Each scroll is independent and can be processed in parallel like any scroll request. Closes #13494

jpountz · 2016-06-07T15:41:53Z

core/src/main/java/org/elasticsearch/search/slice/DocValuesSliceQuery.java

+                    public boolean get(int doc) {
+                        values.setDocument(doc);
+                        for (int i = 0; i < values.count(); i++) {
+                            return contains(Long.hashCode(values.valueAt(i)));


s/Long.hashCode/BitMixer.mix64/ ? otherwise we might still have the issue with doubles given that Long.hashCode is a bit naive

Thanks, I was naive too, I pushed another commit.

jimczi added >enhancement :Scroll v5.0.0-alpha3 >feature and removed >enhancement labels May 10, 2016

s1monw reviewed May 11, 2016
View reviewed changes

jpountz reviewed May 11, 2016
View reviewed changes

s1monw self-assigned this May 12, 2016

nik9000 mentioned this pull request Jun 5, 2016

Reindex from remote #18585

Merged

6 tasks

s1monw reviewed Jun 7, 2016
View reviewed changes

jimczi removed the review label Jun 7, 2016

jimczi merged commit 7f20b3a into elastic:master Jun 7, 2016

jimczi deleted the scroll_by_slice branch June 7, 2016 14:24

jpountz reviewed Jun 7, 2016
View reviewed changes

clintongormley added the release highlight label Jun 7, 2016

jimczi mentioned this pull request Jun 8, 2016

Add an index setting to limit the maximum number of slices allowed in a scroll request. #18782

Merged

This was referenced Jun 20, 2016

Support source filtering for scroll queries olivere/elastic#304

Closed

Support scroll partitioning olivere/elastic#305

Closed

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Scroll labels Feb 14, 2018

Add the ability to partition a scroll in multiple slices. #18237

Add the ability to partition a scroll in multiple slices. #18237

Uh oh!

Conversation

jimczi commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimczi commented May 10, 2016

Uh oh!

s1monw commented May 10, 2016

Uh oh!

jimczi commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented May 11, 2016

Uh oh!

jpountz commented May 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented May 11, 2016

Uh oh!

jimczi commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented May 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented May 12, 2016

Uh oh!

clintongormley commented Jun 3, 2016

Uh oh!

jimczi commented Jun 3, 2016

Uh oh!

clintongormley commented Jun 3, 2016

Uh oh!

jimczi commented Jun 3, 2016

Uh oh!

s1monw Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jun 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jun 7, 2016

Uh oh!

jimczi commented Jun 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jun 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimczi commented May 10, 2016 •

edited

Loading

jimczi commented May 10, 2016 •

edited

Loading

jimczi commented May 11, 2016 •

edited

Loading

s1monw Jun 7, 2016 •

edited

Loading