Infinite reprocessing of messages from Snappy compressed producer #457

gaffneyc · 2017-10-28T13:07:48Z

We have a message producer written in Go and a consumer written in Ruby. It appears that enabling snappy compression on the producer causes messages to be infinitely reprocessed in the Ruby consumer.

The example below works fine on 0.4.3 and a git bisect shows that it may have been introduced by 7acbd7b. I've tried to reduce the problem to the smallest reproducible version (consumer groups are not necessary) though I have not tried to recreate it with a Ruby producer.

Ruby: 2.4.2
Kafka: 0.11.0.1
ruby-kafka: 0.5.0 (and master@6a599c71)

Steps to reproduce

In the code below I have Kafka running locally in docker at 172.18.0.3:9092, you may need to change the broker list to get it working locally.

log: https://gist.github.com/gaffneyc/f2de66eceb7a4c2f9967c0ba4acda402

go get -u github.com/Shopify/sarama
go run producer.go
ruby consumer.rb

producer.go

// go get -u github.com/Shopify/sarama
// go run producer.go
package main

import (
  "strconv"

  "github.com/Shopify/sarama"
)

func main() {
  cfg := sarama.NewConfig()
  cfg.Version = sarama.V0_10_2_0
  cfg.Producer.Return.Successes = true
  cfg.Producer.Return.Errors = true

  // Enabling Compression appears to cause the problem
  cfg.Producer.Compression = sarama.CompressionSnappy

  brokers := []string{"172.18.0.3:9092"}
  producer, err := sarama.NewSyncProducer(brokers, cfg)
  if err != nil {
    panic(err)
  }

  // This only appears to happen with several message on the topic
  for i := 0; i < 30; i++ {
    _, _, err = producer.SendMessage(&sarama.ProducerMessage{
      Topic: "signals",
      Key:   sarama.StringEncoder("abcdefgh12"),
      Value: sarama.StringEncoder(strconv.Itoa(i)),
    })
    if err != nil {
      panic(err)
    }
  }
}

consumer.rb

# Place this in the project directory
$:.unshift File.expand_path("../lib", __FILE__)
require "ruby-kafka"

kafka = Kafka.new(seed_brokers: "172.18.0.3:9092")
trap("INT") { kafka.close; exit(0) }

kafka.each_message(topic: "signals") do |msg|
  puts msg.inspect
end

Expected outcome

Messages are processed and the consumer waits for the next one available

Actual outcome

Messages are processed in an infinite loop. For a topic with a large number of messages it appears that only a subset may be processed.

The text was updated successfully, but these errors were encountered:

gaffneyc · 2017-10-28T13:23:29Z

I've been able to recreate this issue using a Ruby producer. Strangely, when using the Ruby producer the repetition is harder to produce and the loop appear to happen at a slower rate. It also appears that the below producer may need to be run twice before the infinite loop starts in the consumer.

# Place this in the project directory
$:.unshift File.expand_path("../lib", __FILE__)
require "ruby-kafka"
require "snappy"

kafka = Kafka.new(seed_brokers: "172.18.0.3:9092")
producer = kafka.producer(compression_codec: :snappy)

30.times do |i|
  producer.produce(i.to_s, topic: "signals", partition_key: "abcdefgh12")
end

producer.deliver_messages
producer.shutdown

mensfeld · 2017-10-28T13:38:25Z

@gaffneyc possible duplicate of #429 and maybe related to #444. Tracking might be off not only for snappy.

gaffneyc · 2017-10-28T14:37:12Z

I saw #429 and I don't believe this is a duplicate. The messages in the test are much smaller than the default for max_bytes_per_partition. I've also removed the rescue at https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/protocol/decoder.rb#L81 and it's showing the behavior but not raising exceptions.

It's worth pointing out that this is is happening without using a consumer group (simple consumer) so we don't have any of the additional commit tracking that happens in that case.

dasch · 2017-10-28T20:13:51Z

Can you paste some logs at DEBUG level?

gaffneyc · 2017-10-28T20:16:04Z

https://gist.github.com/gaffneyc/f2de66eceb7a4c2f9967c0ba4acda402

dasch · 2017-10-30T10:54:19Z

@gaffneyc can you try with the consumer group API, i.e.

consumer = kafka.consumer(group_id: "my-group")

consumer.subscribe("signals")

consumer.each_message do |message|
  p message
end

And paste the logs if it's still not working?

gaffneyc · 2017-10-30T12:40:25Z

Sure thing, here are the logs and the consumer: https://gist.github.com/gaffneyc/7fea8395235c07d5c0f7619570c3aac9

As part of this test I've also reduced the number of Kafka brokers from 5 to 2 and partitions from 64 to 4 to try to reduce the log output.

dasch · 2017-10-30T13:25:56Z

That's weird. Can you investigate whether there have been any changes to how compression is done in Kafka?

gaffneyc · 2017-10-30T14:27:04Z

Yeah, it is weird. In further testing, with Snappy disabled in the producer everything worked as expected with 0.5.0. As soon as I send a message with Snappy enabled it enters the infinite processing loop.

I'm not sure if you saw but it looked like 7acbd7b was the commit that introduced the problem (found via git bisect). My first thought was that the updated API wasn't quite correct with compression or that the version upgrade handshake might not be working as expected. Honestly, I'm not familiar enough with the code or the Kafka protocol to know where to begin.

I'll test against Kafka 0.10.2.1 to see if I can reproduce it. Jumping into Kafka's changelog and code is probably outside of my depth.

dasch · 2017-10-30T14:41:00Z

It could be that v2 of the fetch API introduces some change in how compressed messages work. It's just weird that you're not getting an error then, as the message is correctly processed...

gaffneyc · 2017-10-30T14:48:52Z

Tested it again on 0.10.2.1 and was able to reproduce it (added the logs to the gist).

Each test is run against a fresh Kafka cluster.

dasch · 2017-10-30T14:50:41Z

It sounds like a change in API semantics for the fetch API, although I'm baffled that it seems like the problem is at the processing level – I would assume that we'd maybe get back the same message from the broker repeatedly.

gaffneyc · 2017-10-30T14:51:46Z

#<Kafka::FetchedMessage:0x000000000168a3e8 @value="25", @key="abcdefgh12", @topic="signals", @partition=2, @offset=0, @create_time=2017-10-30 14:43:52 +0000>
#<Kafka::FetchedMessage:0x000000000168a370 @value="26", @key="abcdefgh12", @topic="signals", @partition=2, @offset=0, @create_time=2017-10-30 14:43:52 +0000>

In the logs it stands out that the offset is 0 for each of the messages. Value is an incrementing number so they should map (roughly) to the offset in Kafka. If the offset isn't being parsed correctly then it could explain why it's never moving forward.

dasch · 2017-10-30T14:51:59Z

If you can run with a local version of ruby-kafka, can you just pretty print all the batches returned from the broker to the terminal?

dasch · 2017-10-30T14:52:36Z

Ah, they're different messages! I thought the same message was being re-processed.

gaffneyc · 2017-10-30T14:54:43Z

Ah yeah, the producer is pushing 30 messages to the topic.

dasch · 2017-10-30T15:02:43Z

Hmm, it could be that the client has to calculate the offset itself when decompressing the messages – compression is sort of weird in Kafka version < 0.11. Basically, you jam all the messages together in a message set, then compress and place those bytes in the value of a new Kafka message, which is the one that is actually written to the broker...

dasch · 2017-10-30T15:03:51Z

My guess would be that in Kafka 0.10, they brokers no longer decompress those message sets in order to set the correct offsets, instead relying on the clients to calculate them based on their relative offset from the "container" message.

dasch · 2017-10-30T15:08:45Z

Can you see if this fixes the problem? #458

dasch · 2017-10-30T15:10:23Z

Force pushed a new version that should work.

gaffneyc · 2017-10-30T15:15:39Z

Yep! That looks to have fixed the issue. I'm getting the correct offsets in the logs and messages are only being processed once.

dasch · 2017-10-30T15:43:14Z

🎉

I'll try to add a test and will merge it tomorrow.

gaffneyc · 2017-10-30T15:44:38Z

Awesome! Thank you for getting a fix in there and building ruby-kafka in the first place.

dasch mentioned this issue Oct 30, 2017

Correct offsets in nested message sets #458

Merged

dasch added the 🐞 bug label Oct 30, 2017

dasch closed this as completed in #458 Oct 31, 2017

dasch mentioned this issue Dec 19, 2017

Compressed messages are yielded with incorrect offsets #505

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite reprocessing of messages from Snappy compressed producer #457

Infinite reprocessing of messages from Snappy compressed producer #457

gaffneyc commented Oct 28, 2017 •

edited

Loading

gaffneyc commented Oct 28, 2017

mensfeld commented Oct 28, 2017

gaffneyc commented Oct 28, 2017

dasch commented Oct 28, 2017

gaffneyc commented Oct 28, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017 •

edited

Loading

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

Infinite reprocessing of messages from Snappy compressed producer #457

Infinite reprocessing of messages from Snappy compressed producer #457

Comments

gaffneyc commented Oct 28, 2017 • edited Loading

Steps to reproduce

producer.go

consumer.rb

Expected outcome

Actual outcome

gaffneyc commented Oct 28, 2017

mensfeld commented Oct 28, 2017

gaffneyc commented Oct 28, 2017

dasch commented Oct 28, 2017

gaffneyc commented Oct 28, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017 • edited Loading

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

dasch commented Oct 30, 2017

gaffneyc commented Oct 30, 2017

gaffneyc commented Oct 28, 2017 •

edited

Loading

gaffneyc commented Oct 30, 2017 •

edited

Loading