Consumer gets stuck in loop fetching partial message #429

will89 · 2017-10-12T14:55:47Z

Version of Ruby: 2.4.2
Version of Kafka: 0.9.0.1
Version of ruby-kafka: 0.4.2

Recently, a consumer got silently stuck trying to pull a message that was larger than the max_bytes_per_partition we had configured for it. It would have been nice if our consumer had raised an error when this scenario occurred. This looks like something the java consumer does in kafka 0.9 based on https://issues.apache.org/jira/browse/KAFKA-3442.

dasch · 2017-10-14T13:13:51Z

Do you have logs from the incident?

dasch · 2017-10-16T09:45:10Z

If the broker simply returns an empty message set there's not much we can do on the client side. Any logs would be helpful, but also a close reading of any recent changes to the protocol or Java client that would allow us to detect this issue.

will89 · 2017-10-16T11:15:13Z

In the logs, the committing offsets messages showed several partitions not moving for 8 days. When we started reading from those partitions again we skipped forward several thousand offsets. When we dug in to it, our producer had generated a single message that was larger than the configured max_bytes_per_partition for the consumer.

When experimenting locally, I produced a message that was 600KB and configured a consumer with a max_bytes_per_partition of 400KB. The decoder,https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/protocol/decoder.rb#L81, would show a message of 600KB was there but would fail to decode because only 400KB was present in the io object.

I can try to do some more digging into the documentation of this behavior.

dasch · 2017-10-16T11:22:08Z

That would be great. From your comment, it sounds like the broker does return some of the content. I'm not quite sure what the client is supposed to do in that situation – I would err on the side of an explicit error in the log, telling the user that she needs to increase max_bytes_per_partition in order to process the message.

will89 · 2017-10-16T13:05:07Z

Reading through, https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchRequest, doesn't seem to give too much guidance on what to do. When going through the source of the 0.9.0.1 java consumer, they made the decision to raise an error in this scenario. https://github.com/apache/kafka/blob/0.9.0.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L365 and https://github.com/apache/kafka/blob/0.9.0.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L580.

dasch · 2017-10-16T13:07:59Z

Yeah, that sounds like the only sensible option, although I'd prefer to keep the consumer running so that the other partitions can be processed...

dasch · 2017-10-16T13:08:34Z

Maybe this error is sufficiently exceptional to warrant crashing the consumer process with an error message.

will89 · 2017-10-16T14:31:45Z

In our situation, I think crashing the consumer process would have been the preferred behavior. As this misconfiguration could have eventually led to no partitions processing.

dasch · 2017-10-16T15:46:02Z

When the message fails to decode, is there no exception being raised?

dasch · 2017-10-16T15:46:34Z

I would expect the consumer to crash, actually... or at least write an exception to the logs.

will89 · 2017-10-16T16:09:35Z

I think it's because this partial message scenario manifests as an EOFError and that error gets ignored here https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/protocol/message_set.rb#L40.

dasch · 2017-10-16T16:39:47Z

Ugh, that's bad – we also get EOFError during normal fetches, since Kafka just hands a slice directly from disk to the client, not worrying about including a whole number of messages...

I guess we could check whether any other messages have been read, and raise an error if the first message results in EOFError.

will89 · 2017-10-16T17:11:57Z

Can the EOFError be raised by more exceptional circumstances like the connection dying?

dasch · 2017-10-16T18:34:59Z

I’m not entirely sure, would have to dive into the IO docs.

dasch added the 🐞 bug label Oct 24, 2017

mensfeld mentioned this issue Oct 28, 2017

Infinite reprocessing of messages from Snappy compressed producer #457

Closed

dasch mentioned this issue Nov 6, 2017

Handle messages that exceed the maximum read size #472

Merged

dasch closed this as completed in #472 Nov 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer gets stuck in loop fetching partial message #429

Consumer gets stuck in loop fetching partial message #429

will89 commented Oct 12, 2017

dasch commented Oct 14, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

Consumer gets stuck in loop fetching partial message #429

Consumer gets stuck in loop fetching partial message #429

Comments

will89 commented Oct 12, 2017

dasch commented Oct 14, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017

will89 commented Oct 16, 2017

dasch commented Oct 16, 2017