Search Shard Failure Logging can result in OOM #27596

mattweber · 2017-11-29T19:03:33Z

I have ran into an issue where we are getting OOM on our coordinating nodes and it appears to be from logging shard failures for queries with a large source. Some info:

ES 5.6.3
Java 1.8.0_141
Linux 4.9.51-10.52.amzn1.x86_64
Cluster is 90 data nodes, 6 client nodes, 3 masters
Index is 135 shards with 1 replica
Client heap is 80G using G1GC, I know, "not recommended".
Queries are MASSIVE and include span and nested queries.

So a query comes into a client, gets parsed and sent to each data node for execution. It fails execution for some reason and the exception gets serialized back to the client and it appears that it includes the source query. Now the client tries to deserialize that exception including the large source query and then logs it. I believe this happens for every failed shard which means the original large query is now multiplied by the number of shards which can easily blow up the heap. To make it worse, it looks like the source query gets pretty printed as well. Not sure if that happens on data node or client.

Here are some log messages I see on a client:

[2017-11-29T15:33:37,374][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][76], node[bgQnrHPiSfKF5yM0goVXUg], [P], s[STARTED], a[id=jwmgKCKMRUagW2un_HHZjQ]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE

[2017-11-29T15:34:06,815][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][102], node[eEdh_9keQZu5h88G0T43vQ], [P], s[STARTED], a[id=s9f39PsfS5q89yYrZOs2pA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE

[2017-11-29T15:34:09,572][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][52], node[-jszFNIJQn-71Uslk2eTmA], [R], s[STARTED], a[id=DF6BYKoaRTGBqTd8BBX_ow]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE

MASSIVE_PRETTY_PRINTED_JSON_HERE is the same source query across each log message. In between these messages I start to see GC collection messages and exceptions like the following sprinkled in:

[2017-11-29T15:34:37,788][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
[2017-11-29T15:34:35,649][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)

And finally, the following OOM:

[2017-11-29T15:34:42,145][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [i-072a3a38771378621] fatal error in thread [Thread-7], exiting
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_141]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_141]
        at org.elasticsearch.ExceptionsHelper.detailedMessage(ExceptionsHelper.java:104) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.ShardSearchFailure.<init>(ShardSearchFailure.java:75) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:191) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:74) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$100(InitialSearchPhase.java:46) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase$1.onFailure(InitialSearchPhase.java:169) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$17(TcpTransport.java:1462) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport$$Lambda$2025/1394696696.run(Unknown Source) ~[?:?]
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1460) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1452) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1395) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
[2017-11-29T15:34:42,151][WARN ][o.e.t.n.Netty4Transport  ] [i-072a3a38771378621] exception caught on transport layer [[id: 0x4b38cdd1, L:/10.152.14.51:46126 - R:ip-10-152-12-173.us-west-2.compute.internal/10.152.12.173:9300]], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.transport.netty4.Netty4Transport.exceptionCaught(Netty4Transport.java:323) [transport-netty4-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:84) [transport-netty4-5.6.3.jar:5.6.3]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_141]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_141]
        at org.apache.logging.log4j.core.pattern.PlainTextRenderer.render(PlainTextRenderer.java:32) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.formatThrowableProxy(ThrowableProxy.java:224) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.formatCause(ThrowableProxy.java:211) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.getExtendedStackTraceAsString(ThrowableProxy.java:476) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:72) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:333) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:232) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:217) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:57) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:308) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.spi.ExtendedLoggerWrapper.logMessage(ExtendedLoggerWrapper.java:217) ~[log4j-api-2.9.1.jar:2.9.1]

It does not appear there is any way to prevent this considering its happing while reading these large serialized exceptions.

The text was updated successfully, but these errors were encountered:

s1monw · 2017-11-30T06:53:14Z

@mattweber what size are we talking about here per search request source? can you give me an example how big they are?

mattweber · 2017-11-30T16:18:05Z

@simonw The query that caused the above OOM was 3.5M of non-pretty json. I pretty printed it and it was 43M. I have seen larger queries as well.

Ultimately the issue is on us for submitting an invalid query but I thought it was crazy we ended up crashing the client node. I think this could potentially happen even in a "normal" case if you have enough shards that fail and a smaller heap on the coordinating node.

Is the "source" pretty printed and converted to a string on the shard to include in the exception or does that happen on the client after deserialization?

mattweber · 2017-12-01T17:06:36Z

Looks like it might be just on the client shard failure logging:

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java#L85

Logs the SearchRequest which toString then converts the SearchSourceBuilder with the humanReadable flag enabled. Ultimately I think that human readable flag is the issue especially when we see the kind of expansion I observed above.

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/builder/SearchSourceBuilder.java#L1499

mattweber · 2017-12-01T17:18:09Z

But that doesn't make sense because that should only happen if debug logging is enabled (it is for that package by default) but I had disabled that and still ran into the OOM with exceptions coming out of the netty layer.

s1monw · 2017-12-05T10:16:38Z

Logs the SearchRequest which toString then converts the SearchSourceBuilder with the humanReadable flag enabled. Ultimately I think that human readable flag is the issue especially when we see the kind of expansion I observed above.

humanreadable doesn't mean it's pretty printed! In-fact, it's not. I think the requests are just too large. I wonder if we should cap them if they are bigger than X MB? They are also stored inside the task etc. they can be massive.

s1monw · 2017-12-05T13:23:00Z

@jasontedor @rjernst WDYT, I wonder if we should limit the number of bytes in the toString method form the SearchSourceBuilder?

mattweber · 2017-12-05T16:02:38Z

@simonw There is definitely pretty printing happening and it is easily reproducible. Just create an index and do a search against a non-existent nested field:

{"query": {"nested":{"path":"blah", "query":{"match_all":{}}}}}}

You will get a response like:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: {\n  \"nested\" : {\n    \"query\" : {\n      \"match_all\" : {\n        \"boost\" : 1.0\n      }\n    },\n    \"path\" : \"blah\",\n    \"ignore_unmapped\" : false,\n    \"score_mode\" : \"avg\",\n    \"boost\" : 1.0\n  }\n}",
                "index_uuid": "Ih1UNJGrT82n7z0Ds1o72g",
                "index": "logtest"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "logtest",
                "node": "ZwHwzg5AQpS2tg6-mBFv6A",
                "reason": {
                    "type": "query_shard_exception",
                    "reason": "failed to create query: {\n  \"nested\" : {\n    \"query\" : {\n      \"match_all\" : {\n        \"boost\" : 1.0\n      }\n    },\n    \"path\" : \"blah\",\n    \"ignore_unmapped\" : false,\n    \"score_mode\" : \"avg\",\n    \"boost\" : 1.0\n  }\n}",
                    "index_uuid": "Ih1UNJGrT82n7z0Ds1o72g",
                    "index": "logtest",
                    "caused_by": {
                        "type": "illegal_state_exception",
                        "reason": "[nested] failed to find nested object under path [blah]"
                    }
                }
            }
        ]
    },
    "status": 400
}

There are a ton of additional whitespace and newline characters added that are not part of the original request. The log message looks like:

[2017-12-05T07:59:07,562][DEBUG][o.e.a.s.TransportSearchAction] [ZwHwzg5] All shards failed for phase: [query]
org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "nested" : {
    "query" : {
      "match_all" : {
        "boost" : 1.0
      }
    },
    "path" : "blah",
    "ignore_unmapped" : false,
    "score_mode" : "avg",
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:329) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:312) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:617) ~[elasticsearch-5.6.3.jar:5.6.3]
...

This was on ES 5.6.3, I will test 6.0 as well.

mattweber · 2017-12-05T16:39:39Z

Tested against our real cluster with same query as previous comment and again see the pretty printing and duplication for each shard...

[2017-12-05T16:36:04,641][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171125-qa-ver2][127], node[-jaI4JyNRbOlcKl9HvZVTQ], [R], s[STARTED], a[id=W3l6T-poTEqAO289q2JluA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=128, source={
  "query" : {
    "nested" : {
      "query" : {
        "match_all" : {
          "boost" : 1.0
        }
      },
      "path" : "blah",
      "ignore_unmapped" : false,
      "score_mode" : "avg",
      "boost" : 1.0
    }
  }
}}]

mattweber · 2017-12-05T16:41:43Z

Same behavior on 6.0 as well:

[2017-12-05T08:40:47,358][DEBUG][o.e.a.s.TransportSearchAction] [cdojAdn] All shards failed for phase: [query]
org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "nested" : {
    "query" : {
      "match_all" : {
        "boost" : 1.0
      }
    },
    "path" : "blah",
    "ignore_unmapped" : false,
    "score_mode" : "avg",
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:327) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:310) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:669) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:530) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:506) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:302) ~[elasticsearch-6.0.0.jar:6.0.0]

s1monw · 2017-12-18T08:57:41Z

@mattweber can you please try 6.1 again. I think this change has removed the cause for this entirely. If so I will go and backport the change for SearchSourceBuilder to 5.6 in the case we do another release. This is also the reason why I didn't find this earlier. I was always looking at 6.x and master but the code was different there.

jasontedor · 2018-03-13T20:29:45Z

No additional feedback, closing.

s1monw self-assigned this Nov 30, 2017

spinscale added the :Core/Infra/Logging Log management and logging utilities label Dec 1, 2017

clintongormley added the feedback_needed label Feb 15, 2018

jasontedor closed this as completed Mar 13, 2018

altinp mentioned this issue Nov 7, 2019

SearchShardFailure handling results in OOM #48910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Shard Failure Logging can result in OOM #27596

Search Shard Failure Logging can result in OOM #27596

mattweber commented Nov 29, 2017 •

edited

Loading

s1monw commented Nov 30, 2017

mattweber commented Nov 30, 2017

mattweber commented Dec 1, 2017

mattweber commented Dec 1, 2017

s1monw commented Dec 5, 2017 •

edited

Loading

s1monw commented Dec 5, 2017

mattweber commented Dec 5, 2017

mattweber commented Dec 5, 2017

mattweber commented Dec 5, 2017

s1monw commented Dec 18, 2017

jasontedor commented Mar 13, 2018

Search Shard Failure Logging can result in OOM #27596

Search Shard Failure Logging can result in OOM #27596

Comments

mattweber commented Nov 29, 2017 • edited Loading

s1monw commented Nov 30, 2017

mattweber commented Nov 30, 2017

mattweber commented Dec 1, 2017

mattweber commented Dec 1, 2017

s1monw commented Dec 5, 2017 • edited Loading

s1monw commented Dec 5, 2017

mattweber commented Dec 5, 2017

mattweber commented Dec 5, 2017

mattweber commented Dec 5, 2017

s1monw commented Dec 18, 2017

jasontedor commented Mar 13, 2018

mattweber commented Nov 29, 2017 •

edited

Loading

s1monw commented Dec 5, 2017 •

edited

Loading