Make behaviour consistent when big oneliners JSON is provided #44

andsel · 2024-08-30T14:42:15Z

With PR #43 was introduced the param decode_size_limit_bytes to provide a limit to the length of the json line that can be parsed to avoid potential OOM errors with very big oneliner files.
The setting has a default value of 20Mb, and this introduces a breakage in behaviour. If a user normally consumes lines of a size bigger than 20MB but that doesn't result in OOM error, with the new version 3.2.0 he will experience a looping error like:

[2024-08-30T16:35:28,969][ERROR][logstash.javapipeline    ][main][1ecda24c09fdc5ba076096bc6e7499b710cb91e796741106f9e28599ed6a58a0] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Stdin codec=><LogStash::Codecs::JSONLines decode_size_limit_bytes=>32768, id=>"debdf17e-41b7-48ab-a678-1c9324a1bc9d", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">, id=>"1ecda24c09fdc5ba076096bc6e7499b710cb91e796741106f9e28599ed6a58a0", enable_metric=>true>
  Error: input buffer full
  Exception: Java::JavaLang::IllegalStateException
  Stack: org.logstash.common.BufferedTokenizerExt.extract(BufferedTokenizerExt.java:83)
org.logstash.common.BufferedTokenizerExt$INVOKER$i$1$0$extract.call(BufferedTokenizerExt$INVOKER$i$1$0$extract.gen)

that stuck the pipeline without any progression.

Proposal

This issue propose to get back to the original behaviour by default and eventually, when the codec has such decode_size_limit_bytes configured, if a line trigger is bigger than the limit, anyway create an event containing the partial string data. It also tag the event so that the pipeline can route and manage the error condition.

This can be implemented after BufferedTokenizerExt is fixed to throw an exception also when the offending token is not the first of the fragment (elastic/logstash#17017).
Ideally the tokenizer should return an iterator that verifies the size limit on every next method call.

The text was updated successfully, but these errors were encountered:

lpeter91 · 2024-12-05T10:50:14Z

Also facing this issue when using logstash-to-logstash integration which internally uses this codec. I think #43 completely broke logstash-input-http. The pipeline basically stops within a few minutes after startup. I'm not entirely sure how the current default 500MB buffer gets filled so fast, the forwarded data is nowhere near that size (at least it's much smaller at rest).

Maybe the integration plugin is using this codec wrong? (If that's even a possible thing.)

jsvd · 2024-12-05T11:34:29Z

Looking into this, I'll post here soon with findings.

jsvd · 2024-12-05T12:48:00Z

I think I know what's wrong. The logstash integration's "logstash input" is using the json_lines codec instead of the normal json codec.
On the "client side" the logstash output will take the batch of data and transform it into a JSON array with the individual json objects (events) inside, so the receiving end (logstash input) should just parse the object as a whole, it shouldn't be using the buffered json_lines.

This is coupled with a separate bug in the BufferedTokenizer when the tokenizer doesn't properly reset its state when data is only extracted from it through flushing:

jruby-9.4.9.0 :001 > data = [{"message" => "a"*20 }]
 => [{"message"=>"aaaaaaaaaaaaaaaaaaaa"}] 
jruby-9.4.9.0 :002 > data_json = data.to_json
 => "[{\"message\":\"aaaaaaaaaaaaaaaaaaaa\"}]" 
jruby-9.4.9.0 :003 > codec = LogStash::Plugin.lookup("codec", "json_lines").new("delimiter" => "\n", "decode_size_limit_bytes" => 100)
 => <LogStash::Codecs::JSONLines [...] >
jruby-9.4.9.0 :004 > codec.decode(data_json) {|event| puts "decode got 1 event" }; codec.flush {|event| puts "flush got 1 event" }
flush got 1 event
 => [#<LogStash::Event:0x450d0da9>] 
jruby-9.4.9.0 :005 > codec.decode(data_json) {|event| puts "decode got 1 event" }; codec.flush {|event| puts "flush got 1 event" }
flush got 1 event
 => [#<LogStash::Event:0x6da63893>] 
jruby-9.4.9.0 :006 > codec.decode(data_json) {|event| puts "decode got 1 event" }; codec.flush {|event| puts "flush got 1 event" }
org.logstash.common.BufferedTokenizerExt.extract(BufferedTokenizerExt.java:85): input buffer full (Java::JavaLang::IllegalStateException)

Because the logstash output sends json payloads without a delimiter (they're just json arrays), then data only comes out of the codec through flushing, triggering this bug.
The bug is cased by the internal state variable inputSize not being reset when it accepts data that doesn't immediately produce any entity (given its not "\n" delimited data).
So inputSize keeps increasing until it raises an exception stating the buffer is full when in fact it's not, it's empty (data is being flushed by .flush calls from the logstash input plugin).

jsvd · 2024-12-05T12:50:01Z

Fixes are therefore needed in:

logstash integration switch to json codec instead of json_lines in the "logstash input" plugin
change the BufferedTokenizer .flush method to reset the inputSize.

[edit] after chatting with @yaauie it makes more sense to fix the logstash output to properly emit ndjson like it was supposed to in the first place.

jsvd · 2024-12-06T09:30:30Z

Fixes in progress:

robbavey · 2025-03-13T18:36:38Z

@andsel Can this be closed?

andsel · 2025-03-14T08:45:06Z

No can't be closed because the feature is still need to be implemented. As it's at version 3.2.2 it uses a big limit (512Mb) and when not configured logs a deprecation message.
This issue is to:

remove the limit
logs a deprecation if the user doesn't define a value for decode_size_limit_bytes

if the decode_size_limit_bytes is set and an "buffer full" condition, it has to recover from the error, create an event and tag it, resume from the next token at

logstash-codec-json_lines/lib/logstash/codecs/json_lines.rb

Lines 69 to 71 in f4e4e00

    
           @buffer.extract(data).each do |line| 
        
             parse_json(@converter.convert(line), &block) 
        
           end

andsel added enhancement status:needs-triage labels Aug 30, 2024

andsel self-assigned this Aug 30, 2024

andsel linked a pull request Aug 30, 2024 that will close this issue

Re-established previous behaviour without a default limit for 'decode_size_limit_bytes' #45

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make behaviour consistent when big oneliners JSON is provided #44

Make behaviour consistent when big oneliners JSON is provided #44

andsel commented Aug 30, 2024 •

edited

Loading

lpeter91 commented Dec 5, 2024

jsvd commented Dec 5, 2024

jsvd commented Dec 5, 2024

jsvd commented Dec 5, 2024 •

edited

Loading

jsvd commented Dec 6, 2024

robbavey commented Mar 13, 2025

andsel commented Mar 14, 2025

Make behaviour consistent when big oneliners JSON is provided #44

Make behaviour consistent when big oneliners JSON is provided #44

Comments

andsel commented Aug 30, 2024 • edited Loading

Proposal

lpeter91 commented Dec 5, 2024

jsvd commented Dec 5, 2024

jsvd commented Dec 5, 2024

jsvd commented Dec 5, 2024 • edited Loading

jsvd commented Dec 6, 2024

robbavey commented Mar 13, 2025

andsel commented Mar 14, 2025

andsel commented Aug 30, 2024 •

edited

Loading

jsvd commented Dec 5, 2024 •

edited

Loading