-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Warn on slow metadata persistence #47005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn on slow metadata persistence #47005
Conversation
Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved.
Pinging @elastic/es-distributed |
Failure looks unrelated, I opened #47012. @elasticmachine please run elasticsearch-ci/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change. Left one optional comment. Looks good o.w.
@@ -320,6 +358,22 @@ void rollback() { | |||
rollbackCleanupActions.forEach(Runnable::run); | |||
finished = true; | |||
} | |||
|
|||
void incrementIndicesWritten() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can just make the fields package-visible and operate directly on them, removing getter and setters here, given that their are only used within IncrementalClusterStateWriter
. Reduces the amount of clutter and avoids the extra test provisions for the interaction mocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserting on the interactions with the increment*
methods is deliberate; it would add a good deal more noise to check that they're being called in a different way.
Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved.
Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved. Backport of #47005.
Today if metadata persistence is excessively slow on a master-ineligible node
then the
ClusterApplierService
emits a warning indicating that theGatewayMetaState
applier was slow, but gives no further details. If it isexcessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.
With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to
10s
. We also emit statistics thatrecord how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.