Skip to content

WriteMessages in Async mode is blocking #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sagarkrkv opened this issue Sep 9, 2019 · 10 comments
Closed

WriteMessages in Async mode is blocking #350

sagarkrkv opened this issue Sep 9, 2019 · 10 comments
Assignees

Comments

@sagarkrkv
Copy link
Contributor

sagarkrkv commented Sep 9, 2019

Describe the bug

WriteMessages call should be non-blocking in async mode. But when there are connection issues with the Kafka broker, i.e when connections take too long establish or are slow, we observed that WriteMessages call that typically finishes in a microsecond, blocks for couple of minutes.

Kafka Version
Kafka 2.2

To Reproduce
It's hard to reproduce the exact scenario where this happens.

Expected behavior

WriteMessages call should drop messages to avoid degradation of system performance.

Additional context

We can either make the existing WriteMessages non-blocking, add a new config option to enable non-blocking mode, or add a new method to write messages in a non-blocking mode.

@sagarkrkv sagarkrkv added the bug label Sep 9, 2019
@achille-roussel
Copy link
Contributor

@sagarkrkv you're correct, this could be better documented but once the internal queue is full the call blocks.

Dropping messages can be controversial, especially if it happens during a kafka outage. Backpressure can help control the amount of data loss that may occur, so it can be a preferable approach in some cases.

I'm not against changing the behavior, since it's not documented yet, I hope that no program depends on this implementation detail.

I'd love to get @stevevls's take as well on the topic as well.

In the mean time you can pass a context with a timeout to unblock the call.

@sagarkrkv
Copy link
Contributor Author

Yes, we are currently adding a context with timeout.

But async and backpressure are a little contradictory, but I agree that this might be disruptive and unexpected change for users who have come to rely on this behavior. So if we want to minimize the disruption we can consider either

add a new config option to enable non-blocking mode or add a new method to write messages in a non-blocking mode.

@stevevls
Copy link
Contributor

Coming late to this party. 😄 I think that if we add behavior to drop messages, it should definitely be guarded by a configuration flag that does not drop by default. Another configuration that we already have is the QueueCapacity setting that can be used to increase the amount of time until there is backpressure.

@sagarkrkv
Copy link
Contributor Author

In a scenario where we have persistant connection issues with the Kafka cluster, or when the producer produces a large amount of messages due to a traffic spike, the QueueCapacity will not help a lot.

Is this something you're willing to accept a PR for? i.e add a config flag to make this non-blocking.

@illotum
Copy link

illotum commented Jun 15, 2020

We encountered a similar issue in a different scenario. The Hash balancer combined with many hundreds partitions leads to the Writer.msgs bottlenecking due to the scheduler not keeping up draining it across all runners.

Frankly, you seem to be going for a very "proper" for Go design with nested runners and the hierarchy of channels, but it's counter productive in this particular case. AFAIU partitionWriters always belong to one and only one Writer, so running them in a go routine only adds scheduling overhead. I'd rather see Writer.run() append messages to batches directly, or call a corresponding partitionWriter func in sync. Very likely it's gonna be faster (if you dispatch the actual network write in a go routine).

Here's me raising the QueueCapacity to 60k, only to delay ineviatable:
Screen Shot 2020-06-15 at 1 56 47 PM

@illotum
Copy link

illotum commented Jan 25, 2021

Missed your v0.4 merge, sounds like this ticket can be closed. Is it resolved?

@gibsn
Copy link

gibsn commented Mar 17, 2021

Hi! Is this issue still present in the latest release?

@illotum
Copy link

illotum commented Mar 17, 2021

@gibsn I verified, it no longer bottlenecks on the queue. Happily chugs along in my application.

@achille-roussel
Copy link
Contributor

Thanks for following up everyone! Glad to hear 0.4 has been serving you well 👍

@gibsn
Copy link

gibsn commented Apr 20, 2022

@illotum I missed your reply 😮 thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants