-
Notifications
You must be signed in to change notification settings - Fork 25.2k
log messages from allocation commands #25955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log messages from allocation commands #25955
Conversation
.map(Optional::get) | ||
.collect(Collectors.toList()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may make more sense to put this method in RoutingExplanations itself since that's what it operates on. My thinking here was that the logic of "which messages do we actually surface" is closer to the response handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.
@bleskes this is the one I mentioned, thanks |
This approach is much better than the previous approaches, it looks promising to me. I'm not going to give a full review, I'll leave that to @bleskes. However, I want to make one comment:
I think we can hold off on that forever, that strikes me as unnecessary. The point of getting this in the logs is for diagnostic purposes, REST responses are ephemeral. A user using this endpoint has (hopefully) thoroughly read the docs and understands what they are doing, adding this to the response provides nothing over that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left two minor comments
@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() { | |||
return this.explanations; | |||
} | |||
|
|||
public List<String> getMessages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add javadocs for this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, missed that
if (request.dryRun() == false) { | ||
response.getMessages().forEach(logger::info); | ||
} | ||
listener.onResponse(response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go into a finally
block in case an exception is thrown in getMessages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. If that were to happen without finally
, would an error response with the exception be returned somewhere "upstream", or would the request handling just time out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume that the error would be logged and the connection would hang until the client timed out (just a guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActionListener#wrap protects against this:
static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,
Consumer<Exception> onFailure) {
return new ActionListener<Response>() {
@Override
public void onResponse(Response response) {
try {
onResponse.accept(response);
} catch (Exception e) {
onFailure(e);
}
}
@Override
public void onFailure(Exception e) {
onFailure.accept(e);
}
};
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments here and there, but I have a bigger issue I want to bring up for discussion. I personally find it unnatural that the messages API is only relevant when the decision is YES. This is also subtle is it not reflected in our code paths but is a convention. Next to that the message must be stateless because of things are set up. Since every command returns RerouteExplanation
, I find it more natural to have an optional extra message there, which actually correlates to the decision. This might also make it more intuitive to expose it on the REST API. We can still choose to log "yes" messages under info. Did you consider this option? Just an idea - what do you think?
@@ -56,6 +60,15 @@ public RoutingExplanations getExplanations() { | |||
return this.explanations; | |||
} | |||
|
|||
public List<String> getMessages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add some java docs? the relationship between Explanations and messages is confusing imo.
.map(Optional::get) | ||
.collect(Collectors.toList()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. I'm +0 to move it to RoutingExplanations. If you prefer it here, I'm good with leaving it here too.
ActionListener<ClusterRerouteResponse> logWrapper = ActionListener.wrap( | ||
response -> { | ||
if (request.dryRun() == false) { | ||
response.getMessages().forEach(logger::info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can come up with some prefix to those. We log them after the operation has been done and which the tense a bit weird - see for example https://github.com/elastic/elasticsearch/pull/25955/files#diff-dc1ff30a0cef7322a73dd1a8e33a1e3aR52 : :"Allocating an empty primary for" - it should be "Allocated" as it is already done. I can't come up with a good prefix though, so by default I would suggest changing the messages to use past tense.
if (request.dryRun() == false) { | ||
response.getMessages().forEach(logger::info); | ||
} | ||
listener.onResponse(response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActionListener#wrap protects against this:
static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,
Consumer<Exception> onFailure) {
return new ActionListener<Response>() {
@Override
public void onResponse(Response response) {
try {
onResponse.accept(response);
} catch (Exception e) {
onFailure(e);
}
}
@Override
public void onFailure(Exception e) {
onFailure.accept(e);
}
};
}
Boaz and I talked and decided we liked making the messages accessible from RerouteExplanations directly, and that the logging should include more background on which commands were applied. It could be something like (1) log each reroute command that had a YES decision, and then maybe separately (2) log messages when the cluster state is applied |
I changed it so that (not during a dry run) it always logs whether each command will be applied, and separately logs feedback messages from commands that were applied after the cluster state is updated. Depending on how we feel about this it might make sense to log the explanations from the decisions too (the ones I looked at seemed informative) The log messages for the test in here (the first allocation command gets a yes decision, the second gets a no) look like this
This is still kind of going on here, but seems a little better since there's a distinction between the messages and the decision summaries now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look good, my only request is to remove the first two lines from:
1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will be applied: allocate an empty primary for [test_index][0] on node [node_t0]
1> [2017-08-03T00:10:17,291][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocation command [allocate_empty_primary] will not be applied: allocate an empty primary for [noexist][1] on node [node_t1]
1> [2017-08-03T00:10:17,379][INFO ][o.e.a.a.c.r.TransportClusterRerouteAction] [node_t0] Allocated an empty primary for [test_index][0] on node [node_t0]. This action can cause data loss. If the old primary rejoins the cluster, its copy of this shard will be deleted.
We only need to log things that were applied (the goal was to gain insight in the log for those potentially dangerous commands). I also left a question.
Thx @andy-elastic
/** | ||
* Builds messages for the user that indicate whether commands were applied to the cluster state or not. | ||
*/ | ||
public List<String> getCommandSummary() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCommandsSummary? or just getSummary?
} | ||
|
||
@Override | ||
public Optional<String> getMessage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, why did you do it here and not as part of the RoutingExplaination
returned by execute()? I'm ok if you prefer this, but just wondering to your thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that we didn't want to change the wire format of RerouteExplanation by adding the message to it (I think Yannick mentioned this), so here seemed like an okay place to generate the message.
Do you mean putting this message into one of the field of the Decision that the RerouteExplanation is constructed with? E.g.
public RerouteExplanation execute(RoutingAllocation allocation, boolean explain) {
// ...
return new RerouteExplanation(this, allocation.decision(Decision.YES, name() + " (allocation command)", "allocated an empty primary for ... rest of the message"));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was was thinking about adding a message field to RerouteExplanation it self. The write thing is not a big deal IMO
@bleskes sorry I must have misunderstood what we talked about the other day. To be clear, we don't want to log general information about command applications (the "Allocation command [name] will/won't be applied" messages) before the cluster state is updated, regardless of whether they'll be applied? The only logging we want is messages about the effects of the command, logged after the cluster state is updated? If I remove those "Allocation command [name] will/won't be applied" logs, |
Agreed!
I'm not 100% sure what you mean, but I think we can go back to what you had? Filter the routing explanation for those with a Yes decision and log that? |
Sounds good, I'll go back to just the original logging (filter the explanations for YES decisions) and move the message to RerouteExplanation. Thanks |
I think the bwc failures on 50a292c weren't caused by this, merging master back in seems to fix them |
Okay it was confirmed those bwc failures were unrelated. @bleskes how does this version look? |
I think it was fine the way it was. It's actually nice for this to be stateless, as it is purely used for logging. Adding state (and transferring it over the wire) is unnecessary here. I don't see why we should expose this in the REST layer, either (I tried to formulate it in a more agreeable way than @jasontedor did -> hold off forever). If someone explicitly gives the command "allocate empty primary with accept_data_loss = true", then there is no need to add a header to the response that something bad might have happened? I think what we had here in the beginning was fine, until it got butchered... :-( |
It sounds like the main choice here is between
I think both of these ideas have strong reasoning behind them. @ywelsch I'm fine going back to the earlier version, or waiting for Boaz to get back if you'd rather I believe we're all in agreement about not exposing it in the REST layer. In the current version I didn't add it to RerouteExplanation's generated xcontent with the intent of not exposing it in the response. |
I agree with @ywelsch that it's better as it was previously. |
@jasontedor @ywelsch I'll change it back to the version where the wire format doesn't change |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one suggestion to improve the log message o.w. LGTM
@@ -70,6 +71,11 @@ public String name() { | |||
return NAME; | |||
} | |||
|
|||
@Override | |||
public Optional<String> getMessage() { | |||
return Optional.of("Allocated a stale primary for [" + index + "][" + shardId + "] on node [" + node + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log messages always start in lower case. Also I wonder if we should give this (and AllocateEmptyPrimaryAllocationCommand) a stronger wording, e.g. "force-allocated a stale primary ...", to make it clearer that this action has been explicitly forced by the user, not the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, will make it more clear that it came from a command
I think your attempt at being agreeable ended here :) My concern was that the new message field is very similar in nature to how the description of the Decision is being used by most allocation commands - more info about a I defer the final decision to the one made by @ywelsch who's the current de-facto "owner" of this code. I'm fine with getting it in as is. @andy-elastic thanks for all the iterations. |
You can safely backport to 6.0. |
Gives allocation commands from the cluster reroute API the ability to provide messages to be logged once the cluster state change has been committed. The purpose of this change is to create a record in the logs when allocation commands which could potentially be destructive are applied. The allocate_empty_primary and allocate_stale_primary commands are the only ones that currently provide log messages. Closes #22821
For #22821
The first approach we tried was #25325. This only logs the messages from allocation commands once the cluster state update has been acknowledged. @ywelsch (who I got this in a little too late for) recommended to hold off on exposing it in the rest layer for now.