Memory leak upon partial TransportShardBulkAction failure #27300

xgwu · 2017-11-07T10:16:07Z

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
5.3.2 - 5.6.3

Plugins installed: [None]

JVM version (java -version):
1.8.0_77-b03

OS version (uname -a if on a Unix-like system):
Linux SVR14982HW1288 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
A production cluster running version 5.3.2 was experiencing very high heap usage after bulk updating some documents. The search load was very light and the cluster was almost idle. Old GC could not reclaim memory even after the bulk requests had ceased running. After some investigation, log4j seems to be the culprit as it holds strong reference to Bulkshardrequests object whenever any exception was thrown during the bulk request execution.

The problem seems quite similar to issues#23798 . But I can reproduce the problem on latest stable version 5.6.3, so the root cause could be different.

Steps to reproduce:
Update a bunch of documents with bulk API, but purposely generate some exceptions, for example, leave a couple of requests with non-existence doc_id or with wrong field type. From ES logs , DEBUG messages pop up showing document missing or mapper parsing exception. The heap used increased significantly depending on a single bulk size. Dump the heap and analyze with MAT, the Bulkshardrequests object is referenced by Log4j's ParameterizedMessage.

The only way I can reclaim the memory is to issue a small bulk request that triggers another exceptions. In which case, the ParameterizedMessage object references the new request with small memory footprint.

Below are the sample heap dump stats for our production cluster:

The text was updated successfully, but these errors were encountered:

xgwu · 2017-11-08T03:10:11Z

It looks like this relates to log4j's ReusableLogEventFactory, which set MutableLogEvent as thread_local.

public class ReusableLogEventFactory implements LogEventFactory {
    private static final ThreadNameCachingStrategy THREAD_NAME_CACHING_STRATEGY = ThreadNameCachingStrategy.create();
    private static final Clock CLOCK = ClockFactory.getClock();

    private static ThreadLocal<MutableLogEvent> mutableLogEventThreadLocal = new ThreadLocal<>();
    private final ContextDataInjector injector = ContextDataInjectorFactory.createInjector();

I tried adding -Dlog4j2.enable.threadlocals=false to ES's jvm.options. Bulk exception won't retain huge memory since then. So I think this is the way to get around the problem.

xgwu · 2017-11-08T03:54:38Z

This log4j reference Garbage-free Steady State Logging explains the possibility of memory leaks.

Garbage-free logging in Log4j 2.6 is partially implemented by reusing objects in ThreadLocal fields, and partially by reusing buffers when converting text to bytes.

ThreadLocal fields holding non-JDK classes can cause memory leaks in web applications when the application server's thread pool continues to reference these fields after the web application is undeployed. To avoid causing memory leaks, Log4j will not use these ThreadLocals when it detects that it is used in a web application (when the javax.servlet.Servlet class is in the classpath, or when system property log4j2.is.webapp is set to "true").

* Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes elastic#32537 * Fixes elastic#27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes #32537 * Fixes #27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes elastic#32537 * Fixes elastic#27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes #32537 * Fixes #27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes elastic#32537 * Fixes elastic#27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes #32537 * Fixes #27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes elastic#32537 * Fixes elastic#27300

* LOGGING: Upgrade to Log4J 2.11.1 * Upgrade to `2.11.1` to fix memory leaks in slow logger when logging large requests * This was caused by a bug in Log4J https://issues.apache.org/jira/browse/LOG4J2-2269 and is fixed in `2.11.1` via https://git-wip-us.apache.org/repos/asf?p=logging-log4j2.git;h=9496c0c * Fixes #32537 * Fixes #27300

DaveCTurner added the :Core/Infra/Logging Log management and logging utilities label Nov 7, 2017

DaveCTurner assigned jasontedor Nov 7, 2017

colings86 added the >bug label Apr 24, 2018

ccoffey mentioned this issue Aug 1, 2018

Heap slowly filling up with org.elasticsearch.index.SearchSlowLog$SlowLogSearchContextPrinter #32537

Closed

ywelsch added the team-discuss label Aug 1, 2018

original-brownbear mentioned this issue Aug 3, 2018

LOGGING: Upgrade to Log4J 2.11.1 #32616

Merged

original-brownbear closed this as completed in #32616 Aug 6, 2018

original-brownbear mentioned this issue Aug 6, 2018

LOGGING: Upgrade to Log4J 2.11.1 (#32616) #32656

Merged

original-brownbear mentioned this issue Aug 7, 2018

LOGGING: Upgrade to Log4J 2.11.1 (#32616) #32668

Merged

original-brownbear mentioned this issue Aug 7, 2018

LOGGING: Upgrade to Log4J 2.11.1 #32675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak upon partial TransportShardBulkAction failure #27300

Memory leak upon partial TransportShardBulkAction failure #27300

xgwu commented Nov 7, 2017 •

edited

Loading

xgwu commented Nov 8, 2017 •

edited

Loading

Uh oh!

xgwu commented Nov 8, 2017

Uh oh!

Memory leak upon partial TransportShardBulkAction failure #27300

Memory leak upon partial TransportShardBulkAction failure #27300

Comments

xgwu commented Nov 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

xgwu commented Nov 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xgwu commented Nov 8, 2017

Uh oh!

xgwu commented Nov 7, 2017 •

edited

Loading

xgwu commented Nov 8, 2017 •

edited

Loading