Avoid unnecessary partition seeks on successful record recovery #2195

v-chernyshev · 2022-03-28T00:39:42Z

I've reordered the sections for this request because, in my opinion, it is easier to understand the desired behaviour when the limitations of the current approach are explained first.

Current Behavior

Let's consider a situation where max.poll.records (500 by default) are available for consumption in a non-transactional record listener container, but the very first record cannot be processed immediately due to e.g. an issue with an external data provider. Here is what happens in the code in case of, at the very least, non-blocking retries:

KafkaMessageListenerContainer.ListenerConsumer.doInvokeRecordListener is eventually invoked with record pointing to the first record and iterator holding the remaining 499.
this.invokeOnMessage fails with an exception.
An error handler is installed, so this.invokeErrorHandler is called.
this.commonErrorHandler.remainingRecords() returns true. Both the failed record and all the remaining records from the iterator are drained into a temporary array, this.commonErrorHandler.handleRemaining leads to DefaultErrorHandler.handleRemaining that performs SeekUtils.seekOrRecover.
An attempt is made to recover the failed record.
No matter whether the recovery operation succeeds or not, the rest of the batch is rewound in seekPartitions.

Unfortunately, this unconditional rewind operation leads to enormous spikes in network I/O as pretty much the same records are requested from Kafka over and over again. To be precise, with a local Kafka broker and a test application that always triggers a recovery I could see spikes exceeding 2000 Mbps in iptraf.

Expected Behavior

I believe that it is not necessary to always rewind the remaining 499 records if the recovery is successful. This is what may happen with the failed record when non-blocking retries are enabled:

If the record cannot be processed due to a fatal exception (e.g. ClassCastException) then it is sent straight into the DLT topic.
If the exception is not fatal, then the record is sent into the next retry destination.

Both these cases allow the next iterator record to be processed immediately as if nothing happened. This very same record will be at the front of the batch during the next poll invocation anyway! The only special case I can think of is KafkaBackoffException, which does absolutely require rewinding all the offsets so that the same batch is consumed again when the affected partitions are resumed.

Context

The production rollout of our service that uses the non-blocking retries feature provided by Spring Kafka triggered a number of network I/O alarms, so we decided to find the root cause. This feature request is the result of the investigation :)

Implementing a workaround for this issue is possible but, admittedly, quite tricky as invokeErrorHandler is a private API method. These are the steps that may be taken:

Implement a custom CommonErrorHandler that:
- Exposes an API that allows remainingRecords to return either true or false.
- Overrides handleRecord in a way that delegates to handleRemaining of the default error handler with a singleton list as the second argument.
Implement a custom RecordInterceptor with an overridden failure method that uses the above API to make remainingRecords return true only if SeekUtils.isBackoffException is true.

ListenerContainerFactoryConfigurer.setContainerCustomizer may then be used to tie things together, e.g.:

@Bean(name = RetryTopicInternalBeanNames.LISTENER_CONTAINER_FACTORY_CONFIGURER_NAME)
public ListenerContainerFactoryConfigurer listenerContainerFactoryConfigurer(
        KafkaConsumerBackoffManager kafkaConsumerBackoffManager,
        DeadLetterPublishingRecovererFactory deadLetterPublishingRecovererFactory,
        @Qualifier(RetryTopicInternalBeanNames.INTERNAL_BACKOFF_CLOCK_BEAN_NAME) Clock clock) {
    final var configurer = new ListenerContainerFactoryConfigurer(...);

    configurer.setContainerCustomizer(container -> {
        final var customErrorHandler = new CustomErrorHandler(container.getCommonErrorHandler());
        container.setRecordInterceptor(new CustomRecordInterceptor<>(customErrorHandler));
        container.setCommonErrorHandler(customErrorHandler);
    });

    return configurer;
}

The text was updated successfully, but these errors were encountered:

tomazfernandes · 2022-03-28T14:03:29Z

Hi @v-chernyshev, thanks for bringing this up. I'm not sure why this logic is as is - @garyrussell sure can elucidate that. But it seems to me that if we don't do seeks in case the record is successfully recovered, in SeekUtils.doSeeks(), everything should work as expected, at least in this particular scenario.

About KafkaBackOffException, at first I can't see why it should have any special treatment in this case - if the DLPRF rethrows it, recovery won't have been successful and seeks will be performed to the first record's offset as it should.

v-chernyshev · 2022-03-28T14:06:57Z

But it seems to me that if we don't do seeks in case the record is successfully recovered, in SeekUtils.doSeeks(), everything should work as expected, at least in this particular scenario.

It may indeed be possible with some API tweaks. The main obstacle, in my opinion, is this block of code. As iterator is drained before handleRemaining the normal processing can no longer resume.

v-chernyshev · 2022-03-28T14:08:13Z

About KafkaBackOffException, at first I can't see why it should have any special treatment in this case

Yeah, you are absolutely right. What I meant when I said that this exception is special is that targeted single-record recovery won't ever be possible for it.

garyrussell · 2022-03-28T16:00:30Z

I'm not sure why this logic is as is ...

It's historical (error handlers pre-dated the retryable topic code by several years).

However, I can see that this might be helpful, even for normal (blocking) retries, after a successful recovery; but it is really exacerbated by retryable topics because recovery is generally successful (at least for the first/main topic).

That said, one of the reasons for doing the seeks is to avoid a rebalance; if we do the retries without polling the consumer and the aggregate of the retry delays exceeds max.poll.interval.ms, Kafka will think we are dead.

I can see (at least) a couple of solutions:

Decorate the iterator with one that can be reset to the position on the first remaining record and have handleRemaining return a boolean to indicate the iterator should be reset. (However, this could suffer from the rebalance problem).
Use similar logic to the FallbackBatchErrorHandler (pause the consumer and invoke the listener with the remaining records).

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/listener/ErrorHandlingUtils.java

Lines 66 to 104 in 31b78c4

    
           consumer.pause(consumer.assignment()); 
        
           try { 
        
           	while (nextBackOff != BackOffExecution.STOP) { 
        
           		consumer.poll(Duration.ZERO); 
        
           		try { 
        
           			ListenerUtils.stoppableSleep(container, nextBackOff); 
        
           		} 
        
           		catch (InterruptedException e1) { 
        
           			Thread.currentThread().interrupt(); 
        
           			seeker.handleBatch(thrownException, records, consumer, container, () -> { }); 
        
           			throw new KafkaException("Interrupted during retry", logLevel, e1); 
        
           		} 
        
           		if (!container.isRunning()) { 
        
           			throw new KafkaException("Container stopped during retries"); 
        
           		} 
        
           		try { 
        
           			invokeListener.run(); 
        
           			return; 
        
           		} 
        
           		catch (Exception e) { 
        
           			if (failed == null) { 
        
           				failed = recordsToString(records); 
        
           			} 
        
           			String toLog = failed; 
        
           			logger.debug(e, () -> "Retry failed for: " + toLog); 
        
           		} 
        
           		nextBackOff = execution.nextBackOff(); 
        
           	} 
        
           	try { 
        
           		recoverer.accept(records, thrownException); 
        
           	} 
        
           	catch (Exception e) { 
        
           		logger.error(e, () -> "Recoverer threw an exception; re-seeking batch"); 
        
           		seeker.handleBatch(thrownException, records, consumer, container, () -> { }); 
        
           	} 
        
           } 
        
           finally { 
        
           	consumer.resume(consumer.assignment()); 
        
           }

We could probably combine 1 & 2 in the listener container. I'll give it some thought.

tomazfernandes · 2022-03-28T16:26:43Z

Just a couple of thoughts:

That said, one of the reasons for doing the seeks is to avoid a rebalance; if we do the retries without polling the consumer and the aggregate of the retry delays exceeds max.poll.interval.ms, Kafka will think we are dead.

Do we still have any retry mechanisms that rely on backing off and retrying several times in memory? If error handling doesn't include a wait (other than when we wait to seek to current again), maybe the rebalancing problem might not be that much of an issue anymore.

I can see (at least) a couple of solutions:

Decorate the iterator with one that can be reset to the position on the first remaining record and have handleRemaining return a boolean to indicate the iterator should be reset. (However, this could suffer from the rebalance problem).

Why do we need to drain the iterator early? Maybe we could just have the failed record and pass the iterator itself to the error handler, which we'd only iterate if we needed to. Of course, that would have API implications, so maybe it's not worth it.

Thanks

garyrussell · 2022-03-28T18:26:35Z

Do we still have any retry mechanisms that rely on backing off and retrying several times in memory?

Yes, SeekUtils.doSeeks() calls stoppableSleep() according to the next back off.

I didn't mean to imply that this would be a problem for non-blocking retries, but it would be a problem for blocking retries;

If we change the logic to never do seeks and the aggregate backoff is, say, 10 seconds per record and we have 500 records, we'd not call poll() for 5000 seconds.

I suppose we could change the logic to do the seeks whenever we are retrying and only continue from where we left off when recovery is successful, that would be ok (but that really wouldn't satisfy @v-chernyshev complaint that the seeks aren't really necessary.

I think my final comment will solve it because we can pause/poll/resume the consumer during the retries as well.

As I said

I'll give it some thought.

Why do we need to drain the iterator early? Maybe we could just have the failed record and pass the iterator itself to the error handler, which we'd only iterate if we needed to. Of course, that would have API implications, so maybe it's not worth it.

Yes, that's one possibility.

Either way, these are non-trivial changes.

tomazfernandes · 2022-03-28T19:11:36Z

I suppose we could change the logic to do the seeks whenever we are retrying and only continue from where we left off when recovery is successful, that would be ok (but that really wouldn't satisfy @v-chernyshev complaint that the seeks aren't really necessary.

Sure, that's what I had in mind, but I see your point.

I think my final comment will solve it because we can pause/poll/resume the consumer during the retries as well.

I see now that for blocking retries that fail this would probably be much better.

Thanks.

garyrussell · 2022-03-31T22:08:32Z

I have had some fresh thoughts on this. Stay tuned for a PoC early next week.

tomazfernandes · 2022-04-01T01:25:00Z

Looking forward to it @garyrussell!

Resolves spring-projects#2195 Add an option to avoid seeks after handling exceptions. Instead, pause the consumer for one `poll()` and use the remaining records as the result of that poll. New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning a boolean to indicate whether the record was recovered and should not be redelivered. `handlaBatchAndReturnRemaining` for batch listeners, returning either the complete set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException` and commits a partial batch. Also includes the classifier refactoring discussed here spring-projects#2185 (comment)

Resolves spring-projects#2195 Add an option to avoid seeks after handling exceptions. Instead, pause the consumer for one `poll()` and use the remaining records as the result of that poll. New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning a boolean to indicate whether the record was recovered and should not be redelivered. `handlaBatchAndReturnRemaining` for batch listeners, returning either the complete set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException` and commits a partial batch. Also includes the classifier refactoring discussed here spring-projects#2185 (comment) The new logic is disabled by default, we can consider enabling it in 3.0 and remove the deprecations.

garyrussell · 2022-04-04T21:51:35Z

I issued a PR for review; please take a look if you get a chance.

Resolves spring-projects#2195 Add an option to avoid seeks after handling exceptions. Instead, pause the consumer for one `poll()` and use the remaining records as the result of that poll. New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning a boolean to indicate whether the record was recovered and should not be redelivered. `handlaBatchAndReturnRemaining` for batch listeners, returning either the complete set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException` and commits a partial batch. Also includes the classifier refactoring discussed here spring-projects#2185 (comment) The new logic is disabled by default, we can consider enabling it in 3.0 and remove the deprecations.

Resolves #2195 Add an option to avoid seeks after handling exceptions. Instead, pause the consumer for one `poll()` and use the remaining records as the result of that poll. New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning a boolean to indicate whether the record was recovered and should not be redelivered. `handlaBatchAndReturnRemaining` for batch listeners, returning either the complete set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException` and commits a partial batch. Also includes the classifier refactoring discussed here #2185 (comment) The new logic is disabled by default, we can consider enabling it in 3.0 and remove the deprecations. * Fix race - do not call `resume()` on the container; the user might have paused after the error. * Change Since to 2.9. * Fix typos. Co-authored-by: Artem Bilan <[email protected]> * Remove unnecessary local variable; add docs. * Polishing - see commit comment for more details - move the resume logic to after the invokes and don't resume if pending records - don't check `isPaused()` after empty poll due to errors; always restore the pending records * Remove unnecessary boolean; fix deprecation warnings and delegating error handlers. * Emergency stop container if the consumer returns records while paused after an error. * Fix race in test - prevent consumer thread from changing pausedConsumers while the test thread is calling revoke/assign. * Remove System.out(). * Add diagnostics to test. * Fix race in test; wait until next poll after consumer thread pauses the partitions. * Fix stubbing in emergency stop test. * Remove unnecessary boolean. **Cherry-pick to `2.9.x`**

- Revert to seeking when `CommitFailedException` - due to a rebalance - For batch listeners, move committing to the try block - For record listeners, `AckMode.RECORD` already commits in the try block - With `AckMode.BATCH`, detect a no-seek retry and call the error handler `handleRemaining`

- Revert to seeking when `CommitFailedException` - due to a rebalance - For batch listeners, move committing to the try block - For record listeners, `AckMode.RECORD` already commits in the try block - With `AckMode.BATCH`, detect a no-seek retry and call the error handler `handleRemaining` **cherry-pick to 2.9.x**

Related to: #2195 - Revert to seeking when `CommitFailedException` - due to a rebalance - For batch listeners, move committing to the try block - For record listeners, `AckMode.RECORD` already commits in the try block - With `AckMode.BATCH`, detect a no-seek retry and call the error handler `handleRemaining` **cherry-pick to 2.9.x** * Fix test class names.

Related to spring-projects#2195 Do not process the remaining records if the first record is from a paused partition.

Related to #2195 Do not process the remaining records if the first record is from a paused partition.

- check `isPartitionPauseRequested()` instead of `isPartitionPaused()` - add diagnostic debug logging - prevent re-pausing the entire consumer each time a poll exits but the first partition is still paused. **cherry-pick to 2.9.x**

Related to spring-projects#2195 - check `isPartitionPauseRequested()` instead of `isPartitionPaused()` - add diagnostic debug logging - prevent re-pausing the entire consumer each time a poll exits but the first partition is still paused. **cherry-pick to 2.9.x**

Related to #2195 - check `isPartitionPauseRequested()` instead of `isPartitionPaused()` - add diagnostic debug logging - prevent re-pausing the entire consumer each time a poll exits but the first partition is still paused. **cherry-pick to 2.9.x** * Checkstyle fix.

Related to spring-projects#2195 Similar to the recent fix when the first partition is paused, we should not continue to process the remaining records if the container itself was paused during the poll after an error occurred with `seekAfterError=false`. **cherry-pick to 2.9.x**

Related to #2195 Similar to the recent fix when the first partition is paused, we should not continue to process the remaining records if the container itself was paused during the poll after an error occurred with `seekAfterError=false`. **cherry-pick to 2.9.x** * Remove unneeded copy/paste code from test. * Fix creation of remaining records. Previously, they were all inserted into the failed record's `TopicPartition` record list.

Use a `LinkedHashMap` for the remaining records so that the order is retained. Fix test after removing some logic in the last commit.

Don't reference deprecated methods on `CommonErrorHandler`.

v-chernyshev added status: waiting-for-triage type: enhancement labels Mar 28, 2022

garyrussell added backport 2.8.x (obsolete) and removed status: waiting-for-triage labels Mar 28, 2022

garyrussell added this to the 3.0 Backlog milestone Mar 28, 2022

garyrussell mentioned this issue Apr 4, 2022

GH-2195: DefaultErrorHandler Improvements #2207

Merged

garyrussell added backport 2.9.x (obsolete) and removed backport 2.8.x (obsolete) labels Apr 5, 2022

garyrussell modified the milestones: 3.0 Backlog, 3.0.0-M4 Apr 5, 2022

garyrussell self-assigned this Apr 13, 2022

artembilan closed this as completed in #2207 Apr 13, 2022

garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Jul 5, 2022

spring-projectsGH-2195: Fix No Seek Retries with Non-Blocking

6423783

Related to spring-projects#2195 Do not process the remaining records if the first record is from a paused partition.

garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Jul 5, 2022

spring-projectsGH-2195: Fix No Seek Retries with Non-Blocking

e22b979

Related to spring-projects#2195 Do not process the remaining records if the first record is from a paused partition.

garyrussell mentioned this issue Jul 5, 2022

GH-2195: Fix No Seek Retries with Non-Blocking #2326

Merged

artembilan pushed a commit that referenced this issue Jul 5, 2022

GH-2195: Fix No Seek Retries with Non-Blocking

e9b5e69

Related to #2195 Do not process the remaining records if the first record is from a paused partition.

artembilan pushed a commit that referenced this issue Jul 5, 2022

GH-2195: Fix No Seek Retries with Non-Blocking

a24a263

Related to #2195 Do not process the remaining records if the first record is from a paused partition.

garyrussell mentioned this issue Jul 5, 2022

GH-2195: Polishing #2327

Merged

garyrussell mentioned this issue Jul 6, 2022

GH-2195: Fix Fix No Seek Retries with Pause #2331

Merged

garyrussell added a commit that referenced this issue Jul 6, 2022

GH-2195: Fix Remaining ConsumerRecords and Test

ad48491

Use a `LinkedHashMap` for the remaining records so that the order is retained. Fix test after removing some logic in the last commit.

garyrussell added a commit that referenced this issue Jul 6, 2022

GH-2195: Fix Remaining ConsumerRecords and Test

6f44846

Use a `LinkedHashMap` for the remaining records so that the order is retained. Fix test after removing some logic in the last commit.

garyrussell added a commit that referenced this issue Mar 14, 2023

GH-2195: Doc Polishing

6518166

Don't reference deprecated methods on `CommonErrorHandler`.

garyrussell added a commit that referenced this issue Mar 14, 2023

GH-2195: Doc Polishing

c9537e0

Don't reference deprecated methods on `CommonErrorHandler`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary partition seeks on successful record recovery #2195

Avoid unnecessary partition seeks on successful record recovery #2195

v-chernyshev commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

v-chernyshev commented Mar 28, 2022

v-chernyshev commented Mar 28, 2022

garyrussell commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

garyrussell commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

garyrussell commented Mar 31, 2022

tomazfernandes commented Apr 1, 2022

garyrussell commented Apr 4, 2022

Avoid unnecessary partition seeks on successful record recovery #2195

Avoid unnecessary partition seeks on successful record recovery #2195

Comments

v-chernyshev commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

v-chernyshev commented Mar 28, 2022

v-chernyshev commented Mar 28, 2022

garyrussell commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

garyrussell commented Mar 28, 2022

tomazfernandes commented Mar 28, 2022

garyrussell commented Mar 31, 2022

tomazfernandes commented Apr 1, 2022

garyrussell commented Apr 4, 2022