Stateless real-time mget #96763

pxsalehi · 2023-06-12T11:23:58Z

The mget counterpart of #93976.

Relates ES-5677

pxsalehi · 2023-06-13T12:11:52Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/40_routing.yml

@@ -8,11 +8,6 @@ routing:
            index:
              number_of_shards: 5
              number_of_routing_shards: 5
-              number_of_replicas: 0


In Stateless, these tests need a search shard.

OK, but why do we remove the cluster health check below and in the other .yml file?

It's basically waiting for a green index which is not needed. See #94385 for more detail.

I think we do need to wait for a green index here, since otherwise the mget could fail in stateless in case the search shard is not yet available. AFAICS, the default is to wait for one active shard.

The problem is if I add wait for green, in stateful the test would never pass since default replica is 1 and we have a one node cluster. To make the test work for both stateful and stateless we need to do this. I've done the same change for a very similar (5 shard) test case for get. Please see 5010402. So far I haven't seen any failures. If it turns out to be an issue I think we'd need to clone the test or play with some related settings.

Can we use auto-expand replicas 0-1 instead then? I think that would work in both setups.

I think this does introduce fragility into testing and we should try to avoid that if we can.

elasticsearchmachine · 2023-06-13T12:23:42Z

Pinging @elastic/es-distributed (Team:Distributed)

kingherc

In general looks good to me. Just a couple of questions.

kingherc · 2023-06-13T17:15:38Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/40_routing.yml

@@ -8,11 +8,6 @@ routing:
            index:
              number_of_shards: 5
              number_of_routing_shards: 5
-              number_of_replicas: 0


OK, but why do we remove the cluster health check below and in the other .yml file?

kingherc · 2023-06-13T17:18:16Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+                listener.delegateFailureAndWrap((l, replicationResponse) -> super.asyncShardOperation(request, shardId, l))
+            );
+        } else if (request.realtime()) {
+            TransportShardMultiGetFomTranslogAction.Request getFromTranslogRequest = new TransportShardMultiGetFomTranslogAction.Request(


nit mgetFromTranslogRequest may be better

kingherc · 2023-06-13T17:30:21Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+        }
+    }
+
+    // Returns the index of entries in response.locations that have a missing result with no failure on the promotable shard.


nit the indices/indexes

kingherc · 2023-06-13T17:32:15Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

@@ -163,6 +303,17 @@ private void asyncShardMultiGet(MultiGetShardRequest request, ShardId shardId, A
        }
    }

+    private DiscoveryNode getCurrentNodeOfPrimary(ShardId shardId) {


nit You can refactor this with the same one in TransportGetAction.

Done in 36d8892 plus the other comments.

kingherc · 2023-06-13T17:46:03Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+                            indexShard.waitForSegmentGeneration(
+                                r.segmentGeneration(),
+                                listener.delegateFailureAndWrap(
+                                    (ll, aLong) -> threadPool.executor(getExecutor(request, shardId))


Is it necessary to execute() asynchronously here? It seems like when execution is here (after the generation has been waited upon) we can also execute the handleLocalGets() function directly here.

I think it is needed as otherwise we'd run handleLocalGets on a REFRESH thread.

kingherc

Thanks, LGTM

henningandersen

This looks good, left a few comments.

henningandersen · 2023-06-14T11:22:55Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/40_routing.yml

@@ -8,11 +8,6 @@ routing:
            index:
              number_of_shards: 5
              number_of_routing_shards: 5
-              number_of_replicas: 0


I think we do need to wait for a green index here, since otherwise the mget could fail in stateless in case the search shard is not yet available. AFAICS, the default is to wait for one active shard.

henningandersen · 2023-06-14T11:24:57Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/60_realtime_refresh.yml

@@ -9,11 +9,6 @@
          settings:
            index:
              refresh_interval: -1
-              number_of_replicas: 0
-
- - do:


Likewise, I think we need the wait for green here for it to work in stateless.

henningandersen · 2023-06-14T11:26:50Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

@@ -180,7 +182,7 @@ private void asyncGet(GetRequest request, ShardId shardId, ActionListener<GetRes
    private void handleGetOnUnpromotableShard(GetRequest request, IndexShard indexShard, ActionListener<GetResponse> listener)
        throws IOException {
        ShardId shardId = indexShard.shardId();
-        DiscoveryNode node = getCurrentNodeOfPrimary(shardId);
+        var node = getCurrentNodeOfPrimary(clusterService.state().routingTable(), clusterService.state().nodes(), shardId);


Let us grab the state() only once to avoid the routing table and nodes being out of sync.

Handled this and the rest in 2f4cb8b.

henningandersen · 2023-06-14T11:29:51Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

-    private DiscoveryNode getCurrentNodeOfPrimary(ShardId shardId) {
-        var clusterState = clusterService.state();
-        var shardRoutingTable = clusterState.routingTable().shardRoutingTable(shardId);
+    static DiscoveryNode getCurrentNodeOfPrimary(RoutingTable routingTable, DiscoveryNodes nodes, ShardId shardId) {


I would find it simpler to reason about this method if it received either the ClusterService or the ClusterState, since that avoids the possibility of the routingTable and the nodes being from different ClusterState versions.

henningandersen · 2023-06-14T11:34:58Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+        ActionListener<MultiGetShardResponse> listener
+    ) throws IOException {
+        ShardId shardId = indexShard.shardId();
+        var node = getCurrentNodeOfPrimary(clusterService.state().routingTable(), clusterService.state().nodes(), shardId);


Let us grab the state() only once.

henningandersen · 2023-06-14T12:25:01Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+    }
+
+    // Returns the indices of entries in response.locations that have a missing result with no failure on the promotable shard.
+    private static List<Integer> locationsWithMissingResults(TransportShardMultiGetFomTranslogAction.Response response) {


I would prefer to loop over the response twice over collecting this list of Integer objects. I.e., we can first check if we have all results, early terminate if we do not. And then loop again to collect the missing results.

henningandersen · 2023-06-14T12:30:08Z

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

+            try {
+                GetResult getResult = indexShard.getService()
+                    .get(
+                        item.id(),
+                        item.storedFields(),
+                        request.realtime(),
+                        item.version(),
+                        item.versionType(),
+                        item.fetchSourceContext(),
+                        request.isForceSyntheticSource()
+                    );
+                response.add(request.locations.get(l), new GetResponse(getResult));
+            } catch (RuntimeException e) {
+                if (TransportActions.isShardNotAvailableException(e)) {
+                    throw e;
+                } else {
+                    logger.debug(() -> format("%s failed to execute multi_get for [%s]", shardId, item.id()), e);
+                    response.add(request.locations.get(l), new MultiGetResponse.Failure(request.index(), item.id(), e));
+                }
+            } catch (IOException e) {
+                logger.debug(() -> format("%s failed to execute multi_get for [%s]", shardId, item.id()), e);
+                response.add(request.locations.get(l), new MultiGetResponse.Failure(request.index(), item.id(), e));
+            }


I think we can refactor this into a method shared with the similar code from shardOperation? Unless I missed a detail, they look identical.

pxsalehi · 2023-06-15T07:41:50Z

@elasticmachine update branch

henningandersen

LGTM.

henningandersen · 2023-06-15T07:49:46Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/40_routing.yml

@@ -8,11 +8,6 @@ routing:
            index:
              number_of_shards: 5
              number_of_routing_shards: 5
-              number_of_replicas: 0


Can we use auto-expand replicas 0-1 instead then? I think that would work in both setups.

I think this does introduce fragility into testing and we should try to avoid that if we can.

…30612-realtime-mget

pxsalehi · 2023-06-15T09:20:55Z

@elasticmachine please run elasticsearch-ci/part-3
(failed test is #96822)

As described in the issue, the change in #96763 has made the MixedClusterClientYamlTestSuiteIT for mget fail very often. For now, let's take the same approach that we have for get. Closes #97236

As described in the issue, the change in elastic#96763 has made the MixedClusterClientYamlTestSuiteIT for mget fail very often. For now, let's take the same approach that we have for get. Closes elastic#97236

As described in the issue, the change in #96763 has made the MixedClusterClientYamlTestSuiteIT for mget fail very often. For now, let's take the same approach that we have for get. Closes #97236

elasticsearchmachine added the v8.9.0 label Jun 12, 2023

pxsalehi force-pushed the ps230612-realtime-mget branch 3 times, most recently from d6ccc73 to 77e377f Compare June 12, 2023 14:36

Stateless real-time mget

8788899

pxsalehi force-pushed the ps230612-realtime-mget branch from 77e377f to 8788899 Compare June 13, 2023 09:23

pxsalehi commented Jun 13, 2023

View reviewed changes

Adapt yaml tests

fd644d6

pxsalehi force-pushed the ps230612-realtime-mget branch from 8b9f1b9 to fd644d6 Compare June 13, 2023 12:19

pxsalehi added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jun 13, 2023

pxsalehi marked this pull request as ready for review June 13, 2023 12:23

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 13, 2023

pxsalehi requested review from Tim-Brooks, kingherc and henningandersen June 13, 2023 12:24

Merge remote-tracking branch 'upstream/main' into ps230612-realtime-mget

bfcca6a

kingherc reviewed Jun 14, 2023

View reviewed changes

pxsalehi added 3 commits June 14, 2023 11:24

Address review comments

36d8892

Merge remote-tracking branch 'upstream/main' into ps230612-realtime-mget

d5bba1e

Cleanup

644debf

pxsalehi requested a review from kingherc June 14, 2023 09:54

kingherc approved these changes Jun 14, 2023

View reviewed changes

henningandersen reviewed Jun 14, 2023

View reviewed changes

Address second review comments

2f4cb8b

pxsalehi requested a review from henningandersen June 14, 2023 14:01

Merge branch 'main' into ps230612-realtime-mget

96a2c89

henningandersen approved these changes Jun 15, 2023

View reviewed changes

pxsalehi added 2 commits June 15, 2023 10:33

Use auto_expand_replicas in yaml tests

9acefaa

Merge remote-tracking branch 'origin/ps230612-realtime-mget' into ps2…

cb0aa5f

…30612-realtime-mget

pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 15, 2023

elasticsearchmachine merged commit 01731ae into elastic:main Jun 15, 2023

pxsalehi deleted the ps230612-realtime-mget branch June 15, 2023 09:56

pxsalehi mentioned this pull request Jun 29, 2023

[CI] MixedClusterClientYamlTestSuiteIT test {p0=mget/40_routing/routing} failing #97236

Closed

pxsalehi mentioned this pull request Jul 6, 2023

Do not use auto_expand_replica in mget yaml rest tests #97427

Merged

Stateless real-time mget #96763

Stateless real-time mget #96763

Uh oh!

Conversation

pxsalehi commented Jun 12, 2023

Uh oh!

pxsalehi Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jun 13, 2023

Uh oh!

kingherc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kingherc left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jun 15, 2023

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jun 15, 2023

Uh oh!

Uh oh!

pxsalehi Jun 13, 2023 •

edited

Loading

pxsalehi Jun 14, 2023 •

edited

Loading