Skip to content

Search Shard Failure Logging can result in OOM #27596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattweber opened this issue Nov 29, 2017 · 11 comments
Closed

Search Shard Failure Logging can result in OOM #27596

mattweber opened this issue Nov 29, 2017 · 11 comments
Assignees
Labels
:Core/Infra/Logging Log management and logging utilities feedback_needed

Comments

@mattweber
Copy link
Contributor

mattweber commented Nov 29, 2017

I have ran into an issue where we are getting OOM on our coordinating nodes and it appears to be from logging shard failures for queries with a large source. Some info:

  • ES 5.6.3
  • Java 1.8.0_141
  • Linux 4.9.51-10.52.amzn1.x86_64
  • Cluster is 90 data nodes, 6 client nodes, 3 masters
  • Index is 135 shards with 1 replica
  • Client heap is 80G using G1GC, I know, "not recommended".
  • Queries are MASSIVE and include span and nested queries.

So a query comes into a client, gets parsed and sent to each data node for execution. It fails execution for some reason and the exception gets serialized back to the client and it appears that it includes the source query. Now the client tries to deserialize that exception including the large source query and then logs it. I believe this happens for every failed shard which means the original large query is now multiplied by the number of shards which can easily blow up the heap. To make it worse, it looks like the source query gets pretty printed as well. Not sure if that happens on data node or client.

Here are some log messages I see on a client:

[2017-11-29T15:33:37,374][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][76], node[bgQnrHPiSfKF5yM0goVXUg], [P], s[STARTED], a[id=jwmgKCKMRUagW2un_HHZjQ]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE
[2017-11-29T15:34:06,815][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][102], node[eEdh_9keQZu5h88G0T43vQ], [P], s[STARTED], a[id=s9f39PsfS5q89yYrZOs2pA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE
[2017-11-29T15:34:09,572][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171017-qa-ver6][52], node[-jszFNIJQn-71Uslk2eTmA], [R], s[STARTED], a[id=DF6BYKoaRTGBqTd8BBX_ow]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=2, source=MASSIVE_PRETTY_PRINTED_JSON_HERE

MASSIVE_PRETTY_PRINTED_JSON_HERE is the same source query across each log message. In between these messages I start to see GC collection messages and exceptions like the following sprinkled in:

[2017-11-29T15:34:37,788][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
[2017-11-29T15:34:35,649][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)

And finally, the following OOM:

[2017-11-29T15:34:42,145][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [i-072a3a38771378621] fatal error in thread [Thread-7], exiting
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_141]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_141]
        at org.elasticsearch.ExceptionsHelper.detailedMessage(ExceptionsHelper.java:104) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.ShardSearchFailure.<init>(ShardSearchFailure.java:75) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:191) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:74) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$100(InitialSearchPhase.java:46) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.search.InitialSearchPhase$1.onFailure(InitialSearchPhase.java:169) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$17(TcpTransport.java:1462) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport$$Lambda$2025/1394696696.run(Unknown Source) ~[?:?]
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1460) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1452) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1395) ~[elasticsearch-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
[2017-11-29T15:34:42,151][WARN ][o.e.t.n.Netty4Transport  ] [i-072a3a38771378621] exception caught on transport layer [[id: 0x4b38cdd1, L:/10.152.14.51:46126 - R:ip-10-152-12-173.us-west-2.compute.internal/10.152.12.173:9300]], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.transport.netty4.Netty4Transport.exceptionCaught(Netty4Transport.java:323) [transport-netty4-5.6.3.jar:5.6.3]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:84) [transport-netty4-5.6.3.jar:5.6.3]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_141]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_141]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_141]
        at org.apache.logging.log4j.core.pattern.PlainTextRenderer.render(PlainTextRenderer.java:32) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.formatThrowableProxy(ThrowableProxy.java:224) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.formatCause(ThrowableProxy.java:211) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.impl.ThrowableProxy.getExtendedStackTraceAsString(ThrowableProxy.java:476) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:72) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:333) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:232) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:217) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:57) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:308) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146) ~[log4j-core-2.9.1.jar:2.9.1]
        at org.apache.logging.log4j.spi.ExtendedLoggerWrapper.logMessage(ExtendedLoggerWrapper.java:217) ~[log4j-api-2.9.1.jar:2.9.1]

It does not appear there is any way to prevent this considering its happing while reading these large serialized exceptions.

@s1monw s1monw self-assigned this Nov 30, 2017
@s1monw
Copy link
Contributor

s1monw commented Nov 30, 2017

@mattweber what size are we talking about here per search request source? can you give me an example how big they are?

@mattweber
Copy link
Contributor Author

@simonw The query that caused the above OOM was 3.5M of non-pretty json. I pretty printed it and it was 43M. I have seen larger queries as well.

Ultimately the issue is on us for submitting an invalid query but I thought it was crazy we ended up crashing the client node. I think this could potentially happen even in a "normal" case if you have enough shards that fail and a smaller heap on the coordinating node.

Is the "source" pretty printed and converted to a string on the shard to include in the exception or does that happen on the client after deserialization?

@spinscale spinscale added the :Core/Infra/Logging Log management and logging utilities label Dec 1, 2017
@mattweber
Copy link
Contributor Author

Looks like it might be just on the client shard failure logging:

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java#L85

Logs the SearchRequest which toString then converts the SearchSourceBuilder with the humanReadable flag enabled. Ultimately I think that human readable flag is the issue especially when we see the kind of expansion I observed above.

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/builder/SearchSourceBuilder.java#L1499

@mattweber
Copy link
Contributor Author

But that doesn't make sense because that should only happen if debug logging is enabled (it is for that package by default) but I had disabled that and still ran into the OOM with exceptions coming out of the netty layer.

@s1monw
Copy link
Contributor

s1monw commented Dec 5, 2017

Logs the SearchRequest which toString then converts the SearchSourceBuilder with the humanReadable flag enabled. Ultimately I think that human readable flag is the issue especially when we see the kind of expansion I observed above.

humanreadable doesn't mean it's pretty printed! In-fact, it's not. I think the requests are just too large. I wonder if we should cap them if they are bigger than X MB? They are also stored inside the task etc. they can be massive.

@s1monw
Copy link
Contributor

s1monw commented Dec 5, 2017

@jasontedor @rjernst WDYT, I wonder if we should limit the number of bytes in the toString method form the SearchSourceBuilder?

@mattweber
Copy link
Contributor Author

@simonw There is definitely pretty printing happening and it is easily reproducible. Just create an index and do a search against a non-existent nested field:

{"query": {"nested":{"path":"blah", "query":{"match_all":{}}}}}}

You will get a response like:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: {\n  \"nested\" : {\n    \"query\" : {\n      \"match_all\" : {\n        \"boost\" : 1.0\n      }\n    },\n    \"path\" : \"blah\",\n    \"ignore_unmapped\" : false,\n    \"score_mode\" : \"avg\",\n    \"boost\" : 1.0\n  }\n}",
                "index_uuid": "Ih1UNJGrT82n7z0Ds1o72g",
                "index": "logtest"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "logtest",
                "node": "ZwHwzg5AQpS2tg6-mBFv6A",
                "reason": {
                    "type": "query_shard_exception",
                    "reason": "failed to create query: {\n  \"nested\" : {\n    \"query\" : {\n      \"match_all\" : {\n        \"boost\" : 1.0\n      }\n    },\n    \"path\" : \"blah\",\n    \"ignore_unmapped\" : false,\n    \"score_mode\" : \"avg\",\n    \"boost\" : 1.0\n  }\n}",
                    "index_uuid": "Ih1UNJGrT82n7z0Ds1o72g",
                    "index": "logtest",
                    "caused_by": {
                        "type": "illegal_state_exception",
                        "reason": "[nested] failed to find nested object under path [blah]"
                    }
                }
            }
        ]
    },
    "status": 400
}

There are a ton of additional whitespace and newline characters added that are not part of the original request. The log message looks like:

[2017-12-05T07:59:07,562][DEBUG][o.e.a.s.TransportSearchAction] [ZwHwzg5] All shards failed for phase: [query]
org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "nested" : {
    "query" : {
      "match_all" : {
        "boost" : 1.0
      }
    },
    "path" : "blah",
    "ignore_unmapped" : false,
    "score_mode" : "avg",
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:329) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:312) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:617) ~[elasticsearch-5.6.3.jar:5.6.3]
...

This was on ES 5.6.3, I will test 6.0 as well.

@mattweber
Copy link
Contributor Author

Tested against our real cluster with same query as previous comment and again see the pretty printing and duplication for each shard...

[2017-12-05T16:36:04,641][DEBUG][o.e.a.s.TransportSearchAction] [i-072a3a38771378621] [brs-20171125-qa-ver2][127], node[-jaI4JyNRbOlcKl9HvZVTQ], [R], s[STARTED], a[id=W3l6T-poTEqAO289q2JluA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[brs], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=256, batchedReduceSize=512, preFilterShardSize=128, source={
  "query" : {
    "nested" : {
      "query" : {
        "match_all" : {
          "boost" : 1.0
        }
      },
      "path" : "blah",
      "ignore_unmapped" : false,
      "score_mode" : "avg",
      "boost" : 1.0
    }
  }
}}]

@mattweber
Copy link
Contributor Author

Same behavior on 6.0 as well:

[2017-12-05T08:40:47,358][DEBUG][o.e.a.s.TransportSearchAction] [cdojAdn] All shards failed for phase: [query]
org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "nested" : {
    "query" : {
      "match_all" : {
        "boost" : 1.0
      }
    },
    "path" : "blah",
    "ignore_unmapped" : false,
    "score_mode" : "avg",
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:327) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:310) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:669) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:530) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:506) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:302) ~[elasticsearch-6.0.0.jar:6.0.0]

@s1monw
Copy link
Contributor

s1monw commented Dec 18, 2017

@mattweber can you please try 6.1 again. I think this change has removed the cause for this entirely. If so I will go and backport the change for SearchSourceBuilder to 5.6 in the case we do another release. This is also the reason why I didn't find this earlier. I was always looking at 6.x and master but the code was different there.

@jasontedor
Copy link
Member

No additional feedback, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Logging Log management and logging utilities feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants