ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

DaveCTurner · 2025-03-06T06:34:45Z

In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a LifecycleExecutionState#stepInfo of the following form:

{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}

The [...6500 snapshot names elided...] is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.

Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

Relates #124183

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-03-06T06:35:09Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone · 2025-03-06T14:25:27Z

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else?

Perhaps we could use the ILM history store for the full error information.

If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

A length limit sounds like a good idea here.

This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181

#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves #124181

elastic#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181

#125054) (#125140) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves #124181

elastic#125054) This commit adds a limit to the `step_info` contained in `LifecycleExcutionState` so that large step info messages will not be stored in the cluster state. Additionally, when generating an ILM history failure, the full exception that is "stringified" is truncated to the same limit, ensuring that we do not accidentally index gigantic documents in the history store. The default limit is 1024 characters. Resolves elastic#124181

DaveCTurner added :Data Management/ILM+SLM Index and Snapshot lifecycle management >bug labels Mar 6, 2025

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 6, 2025

DaveCTurner mentioned this issue Mar 6, 2025

failed to delete snapshots exception message is unboundedly long #124183

Closed

dakrone mentioned this issue Mar 17, 2025

Truncate step_info and error reason in ILM execution state and history #125054

Merged

elasticsearchmachine closed this as completed in #125054 Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

DaveCTurner commented Mar 6, 2025 •

edited

Loading

elasticsearchmachine commented Mar 6, 2025

Uh oh!

dakrone commented Mar 6, 2025

Uh oh!

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

Comments

DaveCTurner commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Mar 6, 2025

Uh oh!

dakrone commented Mar 6, 2025

Uh oh!

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

DaveCTurner commented Mar 6, 2025 •

edited

Loading