-
Notifications
You must be signed in to change notification settings - Fork 339
[Bug?] Trying to join a consumer group indefinitely #854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you doing something funky with threads? |
We initialize a consumer, subscribe to a few topics, and then call I think this started to happen after we upgraded to 1.1.0. There are at least two PRs that make changes around that area, e.g. #818 #817. Although, it is possible it was an Heroku issue since it seems to fail due to timeouts and we haven't had any issues for a full week now. Nevertheless, the client behaviour is not great. The thread that joins the consumer group just loops forever and when Since it is impossible to avoid network errors I was wondering if having a maximum number of attempts when trying to join a consumer group would be a good thing? In our use case, if the process died after failing to join the consumer group, it would then restart and hopefully join the group successfully after. It would avoid having to manually restart it. |
We effectively "solved" this by terminating a consumer that fails to join a group after a certain amount of time so it auto-restarts. Not pretty but it does the trick. Thread.new do
sleep 10.minutes
next if consumer.instance_variable_get(:@group).member?
consumer.stop
exit 1
end |
If this is a bug report, please fill out the following:
Please verify that the problem you're seeing hasn't been fixed by the current
master
of ruby-kafka.Doesn't seem like it has been fixed.
Steps to reproduce
Restart a consumer process (in Heroku). It only happens sometimes, and so far we only noticed it when the restart is initiated by Heroku instead of a normal deploy but it could just be by chance.
Expected outcome
The consumer would start, join the group, and start processing messages.
Actual outcome
The consumer stays in a loop trying to join the consumer group but fails indefinitely until it is restarted manually.
It starts by retrying like this:
Then like this:
Later on it will also starting printing, while still trying to join the group:
Which is odd because I wouldn't expect it to fetch messages if it fails to join the consumer group. Regardless, no messages are processed.
When we eventually restart the process, the SIGTERM signal terminates the fetcher thread but it still continues to try to join the consumer group until it is SIGKILL'ed.
The new process only manages to join the consumer group a couple of minutes after the previous process is SIGKILL'ed. Here is an example when that happens and from there it behaves as expected.
(I will check how a normal restart behaves and see if there is anything interesting in the logs to compare with these ones.)
Should the client have a configuration to terminate itself if it can't join a group after a number of attempts? Anything else I can do to help understand what is happening here?
It appears to be a concurrency issue but I can't pinpoint where or why.
The text was updated successfully, but these errors were encountered: