-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Search Shard Failure Logging can result in OOM #27596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mattweber what size are we talking about here per search request source? can you give me an example how big they are? |
@simonw The query that caused the above OOM was 3.5M of non-pretty json. I pretty printed it and it was 43M. I have seen larger queries as well. Ultimately the issue is on us for submitting an invalid query but I thought it was crazy we ended up crashing the client node. I think this could potentially happen even in a "normal" case if you have enough shards that fail and a smaller heap on the coordinating node. Is the "source" pretty printed and converted to a string on the shard to include in the exception or does that happen on the client after deserialization? |
Looks like it might be just on the client shard failure logging: Logs the |
But that doesn't make sense because that should only happen if debug logging is enabled (it is for that package by default) but I had disabled that and still ran into the OOM with exceptions coming out of the netty layer. |
|
@jasontedor @rjernst WDYT, I wonder if we should limit the number of bytes in the toString method form the SearchSourceBuilder? |
@simonw There is definitely pretty printing happening and it is easily reproducible. Just create an index and do a search against a non-existent nested field:
You will get a response like:
There are a ton of additional whitespace and newline characters added that are not part of the original request. The log message looks like:
This was on ES 5.6.3, I will test 6.0 as well. |
Tested against our real cluster with same query as previous comment and again see the pretty printing and duplication for each shard...
|
Same behavior on 6.0 as well:
|
@mattweber can you please try 6.1 again. I think this change has removed the cause for this entirely. If so I will go and backport the change for SearchSourceBuilder to |
No additional feedback, closing. |
I have ran into an issue where we are getting OOM on our coordinating nodes and it appears to be from logging shard failures for queries with a large source. Some info:
So a query comes into a client, gets parsed and sent to each data node for execution. It fails execution for some reason and the exception gets serialized back to the client and it appears that it includes the source query. Now the client tries to deserialize that exception including the large source query and then logs it. I believe this happens for every failed shard which means the original large query is now multiplied by the number of shards which can easily blow up the heap. To make it worse, it looks like the source query gets pretty printed as well. Not sure if that happens on data node or client.
Here are some log messages I see on a client:
MASSIVE_PRETTY_PRINTED_JSON_HERE
is the same source query across each log message. In between these messages I start to see GC collection messages and exceptions like the following sprinkled in:And finally, the following OOM:
It does not appear there is any way to prevent this considering its happing while reading these large serialized exceptions.
The text was updated successfully, but these errors were encountered: