-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ILM LifecycleExecutionState#stepInfo
adds unbounded data to cluster state
#124181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
Perhaps we could use the ILM history store for the full error information.
A length limit sounds like a good idea here. |
This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181
#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves #124181
elastic#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181
#125054) (#125140) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves #124181
elastic#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181
elastic#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181
Uh oh!
There was an error while loading. Please reload this page.
In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a
LifecycleExecutionState#stepInfo
of the following form:{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}
The
[...6500 snapshot names elided...]
is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.
It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?
Relates #124183
The text was updated successfully, but these errors were encountered: