Losing messages in RetryingBatchErrorHandler/ErrorHandlingUtils during rebalancing #2340

vooft · 2022-07-09T08:28:48Z

In what version(s) of Spring for Apache Kafka are you seeing this issue?

For example:

2.3.7+ (since introduction of RetryingBatchErrorHandler)

Describe the bug

If a listener doing a retry and rebalancing kicks in, then the consumer will be unpaused and every keep-alive consumer.poll() invocation in the error handler will start returning records and the error handler will throw them away.

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/listener/ErrorHandlingUtils.java

Line 69 in e4d9641

consumer.poll(Duration.ZERO);

consumer.pause(consumer.assignment());
try {
    while (nextBackOff != BackOffExecution.STOP) {
        consumer.poll(Duration.ZERO); // after rebalancing this poll will start returning records
        try {
...

It seems that the rebalancing listener KafkaMessageListenerContainer can re-pause the consumer again, but it is not aware that the error handler paused it.

To Reproduce

Create a listener with infinite retry RetryingBatchErrorHandler that will be constantly failing and force rebalancing. After rebalancing finished RetryingBatchErrorHandler will drain all the messages from assigned partitions without any processing.

Expected behavior

RetryingBatchErrorHandler should re-pause consumer after rebalancing.

Sample

Test is using RetryingBatchErrorHandler, but logic in ErrorHandlingUtils is essentially the same
https://github.com/vooft/kafka-retry-issue/blob/master/src/test/java/com/example/kafkaissue/KafkaIssueTest.java

The text was updated successfully, but these errors were encountered:

garyrussell · 2022-07-11T14:12:28Z

Thanks for reporting; this is indeed a bug.

Please note that the RecoveringBatchErrorHandler is the preferred mechanism (pre 2.8) and its functionality is the default in the DefaultErrorHander since 2.8. With that mechanism, the listener throws a BatchListenerFailedException to indicate which record in the batch failed.

vooft · 2022-07-11T14:59:54Z

Thank you for your answer @garyrussell
I'm looking at DefaultErrorHander and it seems that if you just throw an arbitrary exception from the listener it will still delegate the handling to RetryingBatchErrorHandler, not RecoveringBatchErrorHandler, or am I missing something?

In spring-kafka 2.8.x:

Default fallback is FallbackBatchErrorHandler
https://github.com/spring-projects/spring-kafka/blob/2.8.x/spring-kafka/src/main/java/org/springframework/kafka/listener/DefaultErrorHandler.java#L93

And FallbackBatchErrorHandler extends RetryingBatchErrorHandler
https://github.com/spring-projects/spring-kafka/blob/2.8.x/spring-kafka/src/main/java/org/springframework/kafka/listener/FallbackBatchErrorHandler.java#L32

garyrussell · 2022-07-11T15:26:40Z

Yes; I agree; I was just pointing out that the recovering mechanism is preferred to the (older) retrying mechanism that suffers from this bug. I will have a fix soon.

vooft · 2022-07-11T15:53:21Z

Gotcha, thank you!

Resolves spring-projects#2340 The `RetryingBatchErrorHandler` - now called the `FallbackBatchErrorHandler` pauses and resumes the consumer during retries, to allow it to poll the consumer to avoid a forced rebalance. However, if a normal rebalance occurs, for example if a new member joins, the error handler does not re-pause the consumer and silently consumes new records. Add a mechanism to always re-pause the consume when in this retry mode. **cherry-pick to 2.9.x, 2.8.x**

Resolves #2340 The `RetryingBatchErrorHandler` - now called the `FallbackBatchErrorHandler` pauses and resumes the consumer during retries, to allow it to poll the consumer to avoid a forced rebalance. However, if a normal rebalance occurs, for example if a new member joins, the error handler does not re-pause the consumer and silently consumes new records. Add a mechanism to always re-pause the consume when in this retry mode. **cherry-pick to 2.9.x, 2.8.x**

garyrussell · 2022-07-18T18:04:06Z

@vooft 2.8.8 with the fix is now in Maven Central.

vooft added status: waiting-for-triage type: bug labels Jul 9, 2022

garyrussell added backport 2.8.x (obsolete) backport 2.9.x (obsolete) and removed status: waiting-for-triage labels Jul 11, 2022

garyrussell mentioned this issue Jul 11, 2022

GH-2340: Fix Retrying Batch Error Handling #2341

Merged

artembilan closed this as completed in #2341 Jul 11, 2022

vooft mentioned this issue Jul 21, 2022

ClassCastException in ErrorHandlerAdapter in 2.8.8 #2363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing messages in RetryingBatchErrorHandler/ErrorHandlingUtils during rebalancing #2340

Losing messages in RetryingBatchErrorHandler/ErrorHandlingUtils during rebalancing #2340

vooft commented Jul 9, 2022 •

edited

Loading

garyrussell commented Jul 11, 2022

vooft commented Jul 11, 2022 •

edited

Loading

garyrussell commented Jul 11, 2022

vooft commented Jul 11, 2022

garyrussell commented Jul 18, 2022

Losing messages in RetryingBatchErrorHandler/ErrorHandlingUtils during rebalancing #2340

Losing messages in RetryingBatchErrorHandler/ErrorHandlingUtils during rebalancing #2340

Comments

vooft commented Jul 9, 2022 • edited Loading

garyrussell commented Jul 11, 2022

vooft commented Jul 11, 2022 • edited Loading

garyrussell commented Jul 11, 2022

vooft commented Jul 11, 2022

garyrussell commented Jul 18, 2022

vooft commented Jul 9, 2022 •

edited

Loading

vooft commented Jul 11, 2022 •

edited

Loading