log messages from allocation commands #25955

andyb-elastic · 2017-07-28T23:32:33Z

The first approach we tried was #25325. This only logs the messages from allocation commands once the cluster state update has been acknowledged. @ywelsch (who I got this in a little too late for) recommended to hold off on exposing it in the rest layer for now.

Write the messages as a new element to the reroute response. Tests not passing

Build messages from explanations and log them on cluster state ack. IT passes

andyb-elastic · 2017-07-28T23:35:36Z

core/src/main/java/org/elasticsearch/action/admin/cluster/reroute/ClusterRerouteResponse.java

+            .map(Optional::get)
+            .collect(Collectors.toList());
+    }
+


It may make more sense to put this method in RoutingExplanations itself since that's what it operates on. My thinking here was that the logic of "which messages do we actually surface" is closer to the response handling.

See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.

andyb-elastic · 2017-08-01T15:48:28Z

@bleskes this is the one I mentioned, thanks

jasontedor · 2017-08-02T01:02:22Z

This approach is much better than the previous approaches, it looks promising to me. I'm not going to give a full review, I'll leave that to @bleskes. However, I want to make one comment:

@ywelsch (who I got this in a little too late for) recommended to hold off on exposing it in the rest layer for now.

I think we can hold off on that forever, that strikes me as unnecessary. The point of getting this in the logs is for diagnostic purposes, REST responses are ephemeral. A user using this endpoint has (hopefully) thoroughly read the docs and understands what they are doing, adding this to the response provides nothing over that.

dakrone

LGTM, I left two minor comments

dakrone · 2017-08-02T15:50:47Z

core/src/main/java/org/elasticsearch/action/admin/cluster/reroute/ClusterRerouteResponse.java

@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() {
        return this.explanations;
    }

+    public List<String> getMessages() {


Can you add javadocs for this method?

Definitely, missed that

dakrone · 2017-08-02T15:52:18Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+                if (request.dryRun() == false) {
+                    response.getMessages().forEach(logger::info);
+                }
+                listener.onResponse(response);


Should this go into a finally block in case an exception is thrown in getMessages?

Sounds good to me. If that were to happen without finally, would an error response with the exception be returned somewhere "upstream", or would the request handling just time out?

I presume that the error would be logged and the connection would hang until the client timed out (just a guess)

ActionListener#wrap protects against this:

static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse, Consumer<Exception> onFailure) { return new ActionListener<Response>() { @Override public void onResponse(Response response) { try { onResponse.accept(response); } catch (Exception e) { onFailure(e); } } @Override public void onFailure(Exception e) { onFailure.accept(e); } }; }

bleskes

I left some minor comments here and there, but I have a bigger issue I want to bring up for discussion. I personally find it unnatural that the messages API is only relevant when the decision is YES. This is also subtle is it not reflected in our code paths but is a convention. Next to that the message must be stateless because of things are set up. Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision. This might also make it more intuitive to expose it on the REST API. We can still choose to log "yes" messages under info. Did you consider this option? Just an idea - what do you think?

bleskes · 2017-08-02T13:33:03Z

core/src/main/java/org/elasticsearch/action/admin/cluster/reroute/ClusterRerouteResponse.java

@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() {
        return this.explanations;
    }

+    public List<String> getMessages() {


maybe add some java docs? the relationship between Explanations and messages is confusing imo.

bleskes · 2017-08-02T13:34:02Z

core/src/main/java/org/elasticsearch/action/admin/cluster/reroute/ClusterRerouteResponse.java

+            .map(Optional::get)
+            .collect(Collectors.toList());
+    }
+


See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.

bleskes · 2017-08-02T13:37:34Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+        ActionListener<ClusterRerouteResponse> logWrapper = ActionListener.wrap(
+            response -> {
+                if (request.dryRun() == false) {
+                    response.getMessages().forEach(logger::info);


I wonder if we can come up with some prefix to those. We log them after the operation has been done and which the tense a bit weird - see for example https://github.com/elastic/elasticsearch/pull/25955/files#diff-dc1ff30a0cef7322a73dd1a8e33a1e3aR52 : :"Allocating an empty primary for" - it should be "Allocated" as it is already done. I can't come up with a good prefix though, so by default I would suggest changing the messages to use past tense.

bleskes · 2017-08-02T20:57:28Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+                if (request.dryRun() == false) {
+                    response.getMessages().forEach(logger::info);
+                }
+                listener.onResponse(response);


ActionListener#wrap protects against this:

static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse, Consumer<Exception> onFailure) { return new ActionListener<Response>() { @Override public void onResponse(Response response) { try { onResponse.accept(response); } catch (Exception e) { onFailure(e); } } @Override public void onFailure(Exception e) { onFailure.accept(e); } }; }

andyb-elastic · 2017-08-02T22:12:35Z

Boaz and I talked and decided we liked making the messages accessible from RerouteExplanations directly, and that the logging should include more background on which commands were applied. It could be something like (1) log each reroute command that had a YES decision, and then maybe separately (2) log messages when the cluster state is applied

andyb-elastic · 2017-08-03T03:34:53Z

I changed it so that (not during a dry run) it always logs whether each command will be applied, and separately logs feedback messages from commands that were applied after the cluster state is updated. Depending on how we feel about this it might make sense to log the explanations from the decisions too (the ones I looked at seemed informative)

The log messages for the test in here (the first allocation command gets a yes decision, the second gets a no) look like this

  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will be applied: allocate an empty primary for [test_index][0] on node [node_t0]
  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will not be applied: allocate an empty primary for [noexist][1] on node [node_t1]
  1> [2017-08-03T00:10:17,379][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocated an empty primary for [test_index][0] on node [node_t0]. This action can cause data loss. If the old primary rejoins the cluster, its copy of this shard will be deleted.

Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision

This is still kind of going on here, but seems a little better since there's a distinction between the messages and the decision summaries now

bleskes

This look good, my only request is to remove the first two lines from:

  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will be applied: allocate an empty primary for [test_index][0] on node [node_t0]
  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will not be applied: allocate an empty primary for [noexist][1] on node [node_t1]
  1> [2017-08-03T00:10:17,379][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocated an empty primary for [test_index][0] on node [node_t0]. This action can cause data loss. If the old primary rejoins the cluster, its copy of this shard will be deleted.

We only need to log things that were applied (the goal was to gain insight in the log for those potentially dangerous commands). I also left a question.

Thx @andy-elastic

bleskes · 2017-08-03T19:51:44Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/RoutingExplanations.java

+    /**
+     * Builds messages for the user that indicate whether commands were applied to the cluster state or not.
+     */
+    public List<String> getCommandSummary() {


getCommandsSummary? or just getSummary?

bleskes · 2017-08-03T19:54:30Z

.../elasticsearch/cluster/routing/allocation/command/AllocateEmptyPrimaryAllocationCommand.java

+    }
+
+    @Override
+    public Optional<String> getMessage() {


Question, why did you do it here and not as part of the RoutingExplaination returned by execute()? I'm ok if you prefer this, but just wondering to your thinking?

My thinking was that we didn't want to change the wire format of RerouteExplanation by adding the message to it (I think Yannick mentioned this), so here seemed like an okay place to generate the message.

Do you mean putting this message into one of the field of the Decision that the RerouteExplanation is constructed with? E.g.

public RerouteExplanation execute(RoutingAllocation allocation, boolean explain) { // ... return new RerouteExplanation(this, allocation.decision(Decision.YES, name() + " (allocation command)", "allocated an empty primary for ... rest of the message")); }

I was was thinking about adding a message field to RerouteExplanation it self. The write thing is not a big deal IMO

andyb-elastic · 2017-08-04T15:53:14Z

@bleskes sorry I must have misunderstood what we talked about the other day. To be clear, we don't want to log general information about command applications (the "Allocation command [name] will/won't be applied" messages) before the cluster state is updated, regardless of whether they'll be applied? The only logging we want is messages about the effects of the command, logged after the cluster state is updated?

If I remove those "Allocation command [name] will/won't be applied" logs, AllocationCommand#description, and RoutingExplanations#getCommandSummary, does that look like what we want here?

bleskes · 2017-08-04T15:58:37Z

The only logging we want is messages about the effects of the command, logged after the cluster state is updated?

Agreed!

If I remove those "Allocation command [name] will/won't be applied" logs, AllocationCommand#description, and RoutingExplanations#getCommandSummary, does that look like what we want here?

I'm not 100% sure what you mean, but I think we can go back to what you had? Filter the routing explanation for those with a Yes decision and log that?

andyb-elastic · 2017-08-04T16:06:56Z

Sounds good, I'll go back to just the original logging (filter the explanations for YES decisions) and move the message to RerouteExplanation. Thanks

Move message to RerouteExplanation Remove logging for summary of command applications

andyb-elastic · 2017-08-04T19:49:41Z

I think the bwc failures on 50a292c weren't caused by this, merging master back in seems to fix them

andyb-elastic · 2017-08-07T18:09:01Z

Okay it was confirmed those bwc failures were unrelated. @bleskes how does this version look?

ywelsch · 2017-08-15T03:27:06Z

Next to that the message must be stateless because of things are set up. Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision. This might also make it more intuitive to expose it on the REST API.

I think it was fine the way it was. It's actually nice for this to be stateless, as it is purely used for logging. Adding state (and transferring it over the wire) is unnecessary here. I don't see why we should expose this in the REST layer, either (I tried to formulate it in a more agreeable way than @jasontedor did -> hold off forever). If someone explicitly gives the command "allocate empty primary with accept_data_loss = true", then there is no need to add a header to the response that something bad might have happened?

I think what we had here in the beginning was fine, until it got butchered... :-(

andyb-elastic · 2017-08-16T17:50:10Z

It sounds like the main choice here is between

Messages are returned by the AllocationCommand instance and are collected from there (RoutingExplanations -> RerouteExplanation -> AllocationCommand -> getMessage). This doesn't change the wire format
Messages are set in the RerouteExplanation returned by the AllocationCommand and are collected from there (RoutingExplanations -> RerouteExplanation -> getMessage). This adds a field to the wire format

I think both of these ideas have strong reasoning behind them. @ywelsch I'm fine going back to the earlier version, or waiting for Boaz to get back if you'd rather

I believe we're all in agreement about not exposing it in the REST layer. In the current version I didn't add it to RerouteExplanation's generated xcontent with the intent of not exposing it in the response.

jasontedor · 2017-08-17T13:25:23Z

I agree with @ywelsch that it's better as it was previously.

andyb-elastic · 2017-08-17T17:32:32Z

@jasontedor @ywelsch I'll change it back to the version where the wire format doesn't change

andyb-elastic · 2017-08-17T18:24:51Z

Messages are accessed from the command objects, no wire format changes
Shortened the messages - the logs didn't seem like a useful place for this info, and it would be easy to forget to update those messages if the conditions for data loss ever change
Some checkstyle fixes for other parts of the command classes

ywelsch

Left one suggestion to improve the log message o.w. LGTM

ywelsch · 2017-08-18T02:35:48Z

.../elasticsearch/cluster/routing/allocation/command/AllocateStalePrimaryAllocationCommand.java

@@ -70,6 +71,11 @@ public String name() {
        return NAME;
    }

+    @Override
+    public Optional<String> getMessage() {
+        return Optional.of("Allocated a stale primary for [" + index + "][" + shardId + "] on node [" + node + "]");


log messages always start in lower case. Also I wonder if we should give this (and AllocateEmptyPrimaryAllocationCommand) a stronger wording, e.g. "force-allocated a stale primary ...", to make it clearer that this action has been explicitly forced by the user, not the system.

That's a good idea, will make it more clear that it came from a command

bleskes · 2017-08-21T10:01:26Z

until it got butchered... :-(

I think your attempt at being agreeable ended here :)

My concern was that the new message field is very similar in nature to how the description of the Decision is being used by most allocation commands - more info about a YES in a human readable format.

I defer the final decision to the one made by @ywelsch who's the current de-facto "owner" of this code. I'm fine with getting it in as is. @andy-elastic thanks for all the iterations.

andyb-elastic · 2017-08-21T21:33:22Z

@bleskes no problem, this discussion was really helpful for learning about cluster concepts and our priorities here

I'll merge this after CI finishes. @ywelsch should this be backported to 6.x? I'm guessing yes since it's simple

jasontedor · 2017-08-21T21:51:10Z

You can safely backport to 6.0.

Gives allocation commands from the cluster reroute API the ability to provide messages to be logged once the cluster state change has been committed. The purpose of this change is to create a record in the logs when allocation commands which could potentially be destructive are applied. The allocate_empty_primary and allocate_stale_primary commands are the only ones that currently provide log messages. Closes #22821

andyb-elastic added 5 commits July 28, 2017 08:13

wip #22821

d13d904

Write the messages as a new element to the reroute response. Tests not passing

wip #22821

7fc7e91

Build messages from explanations and log them on cluster state ack. IT passes

wip #22821 only log if dry run, fill in IT which passes

fe12f4f

wip #22821 clean up imports and formatting

30b1d7c

Merge branch 'master' into feature/primary-allocation-logging-2

459da7b

andyb-elastic added :Cluster :Core/Infra/Logging Log management and logging utilities v7.0.0 labels Jul 28, 2017

andyb-elastic commented Jul 28, 2017

View reviewed changes

bleskes self-requested a review August 1, 2017 15:52

dakrone approved these changes Aug 2, 2017

View reviewed changes

bleskes suggested changes Aug 2, 2017

View reviewed changes

andyb-elastic added 2 commits August 2, 2017 17:22

Merge branch 'master' into feature/primary-allocation-logging-2

4ae5e1e

wip #22821 log command decisions and success messages separately

c140fb5

wip #22821 javadoc for new routing explanation methods

8b1a14c

bleskes suggested changes Aug 3, 2017

View reviewed changes

andyb-elastic added 2 commits August 4, 2017 10:36

wip #22821

50a292c

Move message to RerouteExplanation Remove logging for summary of command applications

Merge branch 'master' into feature/primary-allocation-logging-2

c409959

andyb-elastic added 2 commits August 17, 2017 10:35

Merge branch 'master' into feature/primary-allocation-logging-2

7c4256b

For #22821 go back to messages accessed from commands

33d62f8

ywelsch approved these changes Aug 18, 2017

View reviewed changes

andyb-elastic added 2 commits August 18, 2017 09:26

Merge branch 'master' into feature/primary-allocation-logging-2

b9db126

For #22821 more context in log messages

3a40d56

Merge branch 'master' into feature/primary-allocation-logging-2

9f69454

andyb-elastic merged commit bdefcbd into elastic:master Aug 22, 2017

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

jpountz added >feature and removed :Core/Infra/Logging Log management and logging utilities labels Jan 29, 2019

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner added the v6.1.0 label Jun 7, 2019

log messages from allocation commands #25955

log messages from allocation commands #25955

Uh oh!

Conversation

andyb-elastic commented Jul 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Aug 1, 2017

Uh oh!

jasontedor commented Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Aug 2, 2017

Uh oh!

andyb-elastic commented Aug 3, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Aug 4, 2017

Uh oh!

bleskes commented Aug 4, 2017

Uh oh!

andyb-elastic commented Aug 4, 2017

Uh oh!

andyb-elastic commented Aug 4, 2017

Uh oh!

andyb-elastic commented Aug 7, 2017

Uh oh!

ywelsch commented Aug 15, 2017

Uh oh!

andyb-elastic commented Aug 16, 2017

Uh oh!

jasontedor commented Aug 17, 2017

Uh oh!

andyb-elastic commented Aug 17, 2017

Uh oh!

andyb-elastic commented Aug 17, 2017

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

jasontedor commented Aug 2, 2017 •

edited

Loading

andyb-elastic Aug 2, 2017 •

edited

Loading