Produce operation fails on retry #544

sparrovv · 2018-03-01T11:23:34Z

What happened:

When rebooting kafka broker, sync producer was failing to retry sending messages. (I assume it only fails for clusters that require authentication)

FYI: This is just an illustration where I found the problems, when I was manually testing that scenario against integration cluster.

This was the first exception I got when producing a message:

Kafka::ConnectionError: Connection refused - connect(2) for {}

/gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:139 in rescue in open
/gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:118 in open
/gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:95 in block in send_request
/gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:94 in send_request
/gems/ruby-kafka-0.5.2/lib/kafka/sasl_authenticator.rb:39 in authenticate!
/gems/ruby-kafka-0.5.2/lib/kafka/connection_builder.rb:25 in build_connection
/gems/ruby-kafka-0.5.2/lib/kafka/broker.rb:141 in connection
/gems/ruby-kafka-0.5.2/lib/kafka/broker.rb:22 in to_s
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:103 in rescue in block in send_buffered_messages
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:82 in block in send_buffered_messages
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:81 in each
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:81 in send_buffered_messages
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:47 in block in execute
/gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:41 in execute
/gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:303 in block in deliver_messages_with_retries
/gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:291 in loop
/gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:291 in deliver_messages_with_retries
/gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:241 in block in deliver_messages
/gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument
/gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:234 in deliver_messages

Initial exception was indicating that the problem was with the
broker.to_s, which was failing, and was prohibiting setting stale on the cluster.

Once I "fixed" that, there was also an issue with cluster.fetch_cluster_info
method, which was failing on the broker that was being rebooted (ensure close).

This PR is only to indicate the problems, and to get some guidance on how to fix it properly.

This is just an illustration where I found problems. What happened: When rebooting kafka broker, sync producer is failing to retry. (I assume it only fails for clusters that require authentication) This was the first exception I got when producing a message: ``` Kafka::ConnectionError: Connection refused - connect(2) for {} /gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:139 in rescue in open /gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:118 in open /gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:95 in block in send_request /gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/connection.rb:94 in send_request /gems/ruby-kafka-0.5.2/lib/kafka/sasl_authenticator.rb:39 in authenticate! /gems/ruby-kafka-0.5.2/lib/kafka/connection_builder.rb:25 in build_connection /gems/ruby-kafka-0.5.2/lib/kafka/broker.rb:141 in connection /gems/ruby-kafka-0.5.2/lib/kafka/broker.rb:22 in to_s /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:103 in rescue in block in send_buffered_messages /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:82 in block in send_buffered_messages /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:81 in each /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:81 in send_buffered_messages /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:47 in block in execute /gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/produce_operation.rb:41 in execute /gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:303 in block in deliver_messages_with_retries /gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:291 in loop /gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:291 in deliver_messages_with_retries /gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:241 in block in deliver_messages /gems/activesupport-5.1.4/lib/active_support/notifications.rb:168 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/instrumenter.rb:19 in instrument /gems/ruby-kafka-0.5.2/lib/kafka/producer.rb:234 in deliver_messages ``` Initial investigation led me that the problem was with the `broker.to_s`, which was failing, and cluster wasn't marked as stale. But once I fixed that, there was also an issue with `cluster.fetch_cluster_info` method, which was failing on the broker that was shut down currently. This PR is only to indicate the problems.

dasch · 2018-03-01T11:27:21Z

Can you try to come up with a functional test that reproduces the problem?

sparrovv · 2018-03-01T12:17:53Z

Can you try to come up with a functional test that reproduces the problem?

As I mentioned, this might be only happening with clusters that require authentication :/ So it might be quite challenging to build a functional test that reproduces it. (I won't defo have time today, that I might take a stab on it tomorrow.)

dasch · 2018-03-01T12:20:39Z

How many retries have you configured, and do you have DEBUG logs?

sparrovv · 2018-03-01T12:22:25Z

lib/kafka/cluster.rb

@@ -357,7 +357,14 @@ def fetch_cluster_info
          @logger.error "Failed to fetch metadata from #{node}: #{e}"
          errors << [node, e]
        ensure
-          broker.disconnect unless broker.nil?
+          begin
+            broker.disconnect unless broker.nil?


This seems to be a bug nevertheless.

we could have a broker instance, that doesn't have an open connection. have a look here:
https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/broker_pool.rb#L11-L30
and here: https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/broker.rb#L7-L14

and when we call disconnect then it tries to open a connection, and that's when it fails.

so if I'm not really missing something, then @broker_pool.connect(node.hostname, node.port) doen't really connect, just creates an instance of broker.

broker.fetch_metadata - creates the acctual socket connection.

Can you create a separate PR where Broker only tries to disconnect if there's an open connection?

yeah, that makes sense.

tbh, It's a bit tricky. when calling connection, it already fails with ConnectionError. It's all b/c in connection builder we try to authenticate, and it's trying to open a socket to service that is unavailable.
https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/connection_builder.rb#L25
https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/sasl_authenticator.rb#L39-L45

I could do this:

def disconnect connection.close rescue ConnectionError nil end

but there are other areas in the code that calls connection and expect it not to fail, like broker.to_s

Try adding this to Broker:

def connected? !@connection.nil? end

Then in disconnect do:

def disconnect connection.close if connected? end

sparrovv · 2018-03-01T12:24:41Z

How many retries have you configured, and do you have DEBUG logs?

2, but it never gets to execute the 2nd retry, as it fails earlier.

I can provide the debug logs.

sparrovv · 2018-03-01T15:13:02Z

I opened a new PR: #545

sparrovv commented Mar 1, 2018

View reviewed changes

sparrovv closed this Mar 1, 2018

sparrovv deleted the produce-operation-bug branch March 1, 2018 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Produce operation fails on retry #544

Produce operation fails on retry #544

sparrovv commented Mar 1, 2018

dasch commented Mar 1, 2018

sparrovv commented Mar 1, 2018

dasch commented Mar 1, 2018

sparrovv Mar 1, 2018

sparrovv Mar 1, 2018

dasch Mar 1, 2018

sparrovv Mar 1, 2018

sparrovv Mar 1, 2018

dasch Mar 1, 2018

sparrovv commented Mar 1, 2018

sparrovv commented Mar 1, 2018

Produce operation fails on retry #544

Produce operation fails on retry #544

Conversation

sparrovv commented Mar 1, 2018

What happened:

dasch commented Mar 1, 2018

sparrovv commented Mar 1, 2018

dasch commented Mar 1, 2018

sparrovv Mar 1, 2018

Choose a reason for hiding this comment

sparrovv Mar 1, 2018

Choose a reason for hiding this comment

dasch Mar 1, 2018

Choose a reason for hiding this comment

sparrovv Mar 1, 2018

Choose a reason for hiding this comment

sparrovv Mar 1, 2018

Choose a reason for hiding this comment

dasch Mar 1, 2018

Choose a reason for hiding this comment

sparrovv commented Mar 1, 2018

sparrovv commented Mar 1, 2018