Skip to content

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DaveCTurner opened this issue Mar 6, 2025 · 2 comments · Fixed by #125054
Closed

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

DaveCTurner opened this issue Mar 6, 2025 · 2 comments · Fixed by #125054
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Mar 6, 2025

In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a LifecycleExecutionState#stepInfo of the following form:

{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}

The [...6500 snapshot names elided...] is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.

Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

Relates #124183

@DaveCTurner DaveCTurner added :Data Management/ILM+SLM Index and Snapshot lifecycle management >bug labels Mar 6, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone
Copy link
Member

dakrone commented Mar 6, 2025

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else?

Perhaps we could use the ILM history store for the full error information.

If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

A length limit sounds like a good idea here.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 17, 2025
This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step
info messages will not be stored in the cluster state. Additionally, when generating an ILM history
failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we
do not accidentally index gigantic documents in the history store.

The default limit is 1024 characters.

Resolves elastic#124181
elasticsearchmachine pushed a commit that referenced this issue Mar 18, 2025
#125054)

This commit adds a limit to the `step_info` contained in
`LifecycleExcutionState` so that large step info messages will not be
stored in the cluster state. Additionally, when generating an ILM
history failure, the full exception that is "stringified" is truncated
to the same limit, ensuring that we do not accidentally index gigantic
documents in the history store.

The default limit is 1024 characters.

Resolves #124181
dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 18, 2025
elastic#125054)

This commit adds a limit to the `step_info` contained in
`LifecycleExcutionState` so that large step info messages will not be
stored in the cluster state. Additionally, when generating an ILM
history failure, the full exception that is "stringified" is truncated
to the same limit, ensuring that we do not accidentally index gigantic
documents in the history store.

The default limit is 1024 characters.

Resolves elastic#124181
elasticsearchmachine pushed a commit that referenced this issue Mar 18, 2025
#125054) (#125140)

This commit adds a limit to the `step_info` contained in
`LifecycleExcutionState` so that large step info messages will not be
stored in the cluster state. Additionally, when generating an ILM
history failure, the full exception that is "stringified" is truncated
to the same limit, ensuring that we do not accidentally index gigantic
documents in the history store.

The default limit is 1024 characters.

Resolves #124181
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this issue Mar 21, 2025
elastic#125054)

This commit adds a limit to the `step_info` contained in
`LifecycleExcutionState` so that large step info messages will not be
stored in the cluster state. Additionally, when generating an ILM
history failure, the full exception that is "stringified" is truncated
to the same limit, ensuring that we do not accidentally index gigantic
documents in the history store.

The default limit is 1024 characters.

Resolves elastic#124181
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this issue Mar 28, 2025
elastic#125054)

This commit adds a limit to the `step_info` contained in
`LifecycleExcutionState` so that large step info messages will not be
stored in the cluster state. Additionally, when generating an ILM
history failure, the full exception that is "stringified" is truncated
to the same limit, ensuring that we do not accidentally index gigantic
documents in the history store.

The default limit is 1024 characters.

Resolves elastic#124181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants