Skip to content

log messages from allocation commands #25955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Aug 22, 2017
Merged

log messages from allocation commands #25955

merged 15 commits into from
Aug 22, 2017

Conversation

andyb-elastic
Copy link
Contributor

For #22821

The first approach we tried was #25325. This only logs the messages from allocation commands once the cluster state update has been acknowledged. @ywelsch (who I got this in a little too late for) recommended to hold off on exposing it in the rest layer for now.

Write the messages as a new element to the reroute response.

Tests not passing
Build messages from explanations and log them on cluster state
ack. IT passes
@andyb-elastic andyb-elastic added :Cluster :Core/Infra/Logging Log management and logging utilities v7.0.0 labels Jul 28, 2017
.map(Optional::get)
.collect(Collectors.toList());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make more sense to put this method in RoutingExplanations itself since that's what it operates on. My thinking here was that the logic of "which messages do we actually surface" is closer to the response handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.

@andyb-elastic
Copy link
Contributor Author

@bleskes this is the one I mentioned, thanks

@bleskes bleskes self-requested a review August 1, 2017 15:52
@jasontedor
Copy link
Member

jasontedor commented Aug 2, 2017

This approach is much better than the previous approaches, it looks promising to me. I'm not going to give a full review, I'll leave that to @bleskes. However, I want to make one comment:

@ywelsch (who I got this in a little too late for) recommended to hold off on exposing it in the rest layer for now.

I think we can hold off on that forever, that strikes me as unnecessary. The point of getting this in the logs is for diagnostic purposes, REST responses are ephemeral. A user using this endpoint has (hopefully) thoroughly read the docs and understands what they are doing, adding this to the response provides nothing over that.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left two minor comments

@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() {
return this.explanations;
}

public List<String> getMessages() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs for this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, missed that

if (request.dryRun() == false) {
response.getMessages().forEach(logger::info);
}
listener.onResponse(response);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into a finally block in case an exception is thrown in getMessages?

Copy link
Contributor Author

@andyb-elastic andyb-elastic Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. If that were to happen without finally, would an error response with the exception be returned somewhere "upstream", or would the request handling just time out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume that the error would be logged and the connection would hang until the client timed out (just a guess)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActionListener#wrap protects against this:

    static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,
            Consumer<Exception> onFailure) {
        return new ActionListener<Response>() {
            @Override
            public void onResponse(Response response) {
                try {
                    onResponse.accept(response);
                } catch (Exception e) {
                    onFailure(e);
                }
            }

            @Override
            public void onFailure(Exception e) {
                onFailure.accept(e);
            }
        };
    }

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments here and there, but I have a bigger issue I want to bring up for discussion. I personally find it unnatural that the messages API is only relevant when the decision is YES. This is also subtle is it not reflected in our code paths but is a convention. Next to that the message must be stateless because of things are set up. Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision. This might also make it more intuitive to expose it on the REST API. We can still choose to log "yes" messages under info. Did you consider this option? Just an idea - what do you think?

@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() {
return this.explanations;
}

public List<String> getMessages() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some java docs? the relationship between Explanations and messages is confusing imo.

.map(Optional::get)
.collect(Collectors.toList());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.

ActionListener<ClusterRerouteResponse> logWrapper = ActionListener.wrap(
response -> {
if (request.dryRun() == false) {
response.getMessages().forEach(logger::info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can come up with some prefix to those. We log them after the operation has been done and which the tense a bit weird - see for example https://github.com/elastic/elasticsearch/pull/25955/files#diff-dc1ff30a0cef7322a73dd1a8e33a1e3aR52 : :"Allocating an empty primary for" - it should be "Allocated" as it is already done. I can't come up with a good prefix though, so by default I would suggest changing the messages to use past tense.

if (request.dryRun() == false) {
response.getMessages().forEach(logger::info);
}
listener.onResponse(response);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActionListener#wrap protects against this:

    static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,
            Consumer<Exception> onFailure) {
        return new ActionListener<Response>() {
            @Override
            public void onResponse(Response response) {
                try {
                    onResponse.accept(response);
                } catch (Exception e) {
                    onFailure(e);
                }
            }

            @Override
            public void onFailure(Exception e) {
                onFailure.accept(e);
            }
        };
    }

@andyb-elastic
Copy link
Contributor Author

Boaz and I talked and decided we liked making the messages accessible from RerouteExplanations directly, and that the logging should include more background on which commands were applied. It could be something like (1) log each reroute command that had a YES decision, and then maybe separately (2) log messages when the cluster state is applied

@andyb-elastic
Copy link
Contributor Author

I changed it so that (not during a dry run) it always logs whether each command will be applied, and separately logs feedback messages from commands that were applied after the cluster state is updated. Depending on how we feel about this it might make sense to log the explanations from the decisions too (the ones I looked at seemed informative)

The log messages for the test in here (the first allocation command gets a yes decision, the second gets a no) look like this

  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will be applied: allocate an empty primary for [test_index][0] on node [node_t0]
  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will not be applied: allocate an empty primary for [noexist][1] on node [node_t1]
  1> [2017-08-03T00:10:17,379][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocated an empty primary for [test_index][0] on node [node_t0]. This action can cause data loss. If the old primary rejoins the cluster, its copy of this shard will be deleted.

Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision

This is still kind of going on here, but seems a little better since there's a distinction between the messages and the decision summaries now

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look good, my only request is to remove the first two lines from:

  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will be applied: allocate an empty primary for [test_index][0] on node [node_t0]
  1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will not be applied: allocate an empty primary for [noexist][1] on node [node_t1]
  1> [2017-08-03T00:10:17,379][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocated an empty primary for [test_index][0] on node [node_t0]. This action can cause data loss. If the old primary rejoins the cluster, its copy of this shard will be deleted.

We only need to log things that were applied (the goal was to gain insight in the log for those potentially dangerous commands). I also left a question.

Thx @andy-elastic

/**
* Builds messages for the user that indicate whether commands were applied to the cluster state or not.
*/
public List<String> getCommandSummary() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCommandsSummary? or just getSummary?

}

@Override
public Optional<String> getMessage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, why did you do it here and not as part of the RoutingExplaination returned by execute()? I'm ok if you prefer this, but just wondering to your thinking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that we didn't want to change the wire format of RerouteExplanation by adding the message to it (I think Yannick mentioned this), so here seemed like an okay place to generate the message.

Do you mean putting this message into one of the field of the Decision that the RerouteExplanation is constructed with? E.g.

public RerouteExplanation execute(RoutingAllocation allocation, boolean explain) {
    // ...
    return new RerouteExplanation(this, allocation.decision(Decision.YES, name() + " (allocation command)", "allocated an empty primary for ... rest of the message"));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was was thinking about adding a message field to RerouteExplanation it self. The write thing is not a big deal IMO

@andyb-elastic
Copy link
Contributor Author

@bleskes sorry I must have misunderstood what we talked about the other day. To be clear, we don't want to log general information about command applications (the "Allocation command [name] will/won't be applied" messages) before the cluster state is updated, regardless of whether they'll be applied? The only logging we want is messages about the effects of the command, logged after the cluster state is updated?

If I remove those "Allocation command [name] will/won't be applied" logs, AllocationCommand#description, and RoutingExplanations#getCommandSummary, does that look like what we want here?

@bleskes
Copy link
Contributor

bleskes commented Aug 4, 2017

The only logging we want is messages about the effects of the command, logged after the cluster state is updated?

Agreed!

If I remove those "Allocation command [name] will/won't be applied" logs, AllocationCommand#description, and RoutingExplanations#getCommandSummary, does that look like what we want here?

I'm not 100% sure what you mean, but I think we can go back to what you had? Filter the routing explanation for those with a Yes decision and log that?

@andyb-elastic
Copy link
Contributor Author

Sounds good, I'll go back to just the original logging (filter the explanations for YES decisions) and move the message to RerouteExplanation. Thanks

Move message to RerouteExplanation
Remove logging for summary of command applications
@andyb-elastic
Copy link
Contributor Author

I think the bwc failures on 50a292c weren't caused by this, merging master back in seems to fix them

@andyb-elastic
Copy link
Contributor Author

Okay it was confirmed those bwc failures were unrelated. @bleskes how does this version look?

@ywelsch
Copy link
Contributor

ywelsch commented Aug 15, 2017

Next to that the message must be stateless because of things are set up. Since every command returns RerouteExplanation, I find it more natural to have an optional extra message there, which actually correlates to the decision. This might also make it more intuitive to expose it on the REST API.

I think it was fine the way it was. It's actually nice for this to be stateless, as it is purely used for logging. Adding state (and transferring it over the wire) is unnecessary here. I don't see why we should expose this in the REST layer, either (I tried to formulate it in a more agreeable way than @jasontedor did -> hold off forever). If someone explicitly gives the command "allocate empty primary with accept_data_loss = true", then there is no need to add a header to the response that something bad might have happened?

I think what we had here in the beginning was fine, until it got butchered... :-(

@andyb-elastic
Copy link
Contributor Author

It sounds like the main choice here is between

  • Messages are returned by the AllocationCommand instance and are collected from there (RoutingExplanations -> RerouteExplanation -> AllocationCommand -> getMessage). This doesn't change the wire format
  • Messages are set in the RerouteExplanation returned by the AllocationCommand and are collected from there (RoutingExplanations -> RerouteExplanation -> getMessage). This adds a field to the wire format

I think both of these ideas have strong reasoning behind them. @ywelsch I'm fine going back to the earlier version, or waiting for Boaz to get back if you'd rather

I believe we're all in agreement about not exposing it in the REST layer. In the current version I didn't add it to RerouteExplanation's generated xcontent with the intent of not exposing it in the response.

@jasontedor
Copy link
Member

I agree with @ywelsch that it's better as it was previously.

@andyb-elastic
Copy link
Contributor Author

@jasontedor @ywelsch I'll change it back to the version where the wire format doesn't change

@andyb-elastic
Copy link
Contributor Author

  • Messages are accessed from the command objects, no wire format changes
  • Shortened the messages - the logs didn't seem like a useful place for this info, and it would be easy to forget to update those messages if the conditions for data loss ever change
  • Some checkstyle fixes for other parts of the command classes

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one suggestion to improve the log message o.w. LGTM

@@ -70,6 +71,11 @@ public String name() {
return NAME;
}

@Override
public Optional<String> getMessage() {
return Optional.of("Allocated a stale primary for [" + index + "][" + shardId + "] on node [" + node + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log messages always start in lower case. Also I wonder if we should give this (and AllocateEmptyPrimaryAllocationCommand) a stronger wording, e.g. "force-allocated a stale primary ...", to make it clearer that this action has been explicitly forced by the user, not the system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, will make it more clear that it came from a command

@bleskes
Copy link
Contributor

bleskes commented Aug 21, 2017

until it got butchered... :-(

I think your attempt at being agreeable ended here :)

My concern was that the new message field is very similar in nature to how the description of the Decision is being used by most allocation commands - more info about a YES in a human readable format.

I defer the final decision to the one made by @ywelsch who's the current de-facto "owner" of this code. I'm fine with getting it in as is. @andy-elastic thanks for all the iterations.

@andyb-elastic
Copy link
Contributor Author

@bleskes no problem, this discussion was really helpful for learning about cluster concepts and our priorities here

I'll merge this after CI finishes. @ywelsch should this be backported to 6.x? I'm guessing yes since it's simple

@jasontedor
Copy link
Member

You can safely backport to 6.0.

@andyb-elastic andyb-elastic merged commit bdefcbd into elastic:master Aug 22, 2017
andyb-elastic added a commit that referenced this pull request Aug 22, 2017
Gives allocation commands from the cluster reroute API
the ability to provide messages to be logged once the 
cluster state change has been committed. 

The purpose of this change is to create a record in the 
logs when allocation commands which could potentially
be destructive are applied. The allocate_empty_primary
and allocate_stale_primary commands are the only ones
that currently provide log messages.

Closes #22821
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
@jpountz jpountz added >feature and removed :Core/Infra/Logging Log management and logging utilities labels Jan 29, 2019
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >feature v6.1.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants