Skip to content

return write error only when all retries have failed #452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

abuchanan-nr
Copy link
Contributor

This is a potential fix for #451. It passes my quick-and-dirty local test.

This changes the code to only return an error (via the message "res" channel) when all write retries have failed.

This also moves the stat counters associated with a successful write (messages, bytes, and batchSize), so that they are not inflated by each call to write (i.e. each retry/attempt).

@VibhorGupta1991
Copy link

This change needs to be pushed really quick. I have been struggling to find the cause but finally came across this [#451 ].
This is affecting only 0.3.6 version as of now. @abuchanan-nr your change [#382 ] was merged on 22nd Feb in this version.

res <- nil
}
for _, m := range batch {
w.stats.messages.observe(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take this out of the loop? It seems like it could be w.stats.messages.observe(int64(len(batch)))

@stevevls
Copy link
Contributor

stevevls commented Jun 3, 2020

Thanks for the fix @abuchanan-nr ! I traced through and agree with your assessment that there may not be a channel receiver on a second attempt.

I would prefer that we leave the meaning of the stats as-is, though. It's possible that someone out there is depending on them, so I'd prefer we don't change that and create a surprise for someone.

@jnjackins
Copy link
Contributor

Given the nature of the breakage I believe we should revert #382 in the meantime -- see #462. Since there are low expectations for the reliability of async writes, we should give preference to sync writes which are meant to be reliable, and they are currently broken. Also, we should be sure to add more tests to cover these changes -- currently there is no test coverage for the partitionWriter abstraction, which is mocked over in tests (specifically the only test we have for retries).

If we do want to move forward with doing retries in the partitionWriter, I'd suggest we add a new interface which abstracts the write method, which can have a simple (non-retrying) implementation, and then we can add a retryWriter implementation which wraps the simple one. partitionWriters can be configured to use the retryWriter implementation normally, and in tests the simple writer can be faked and we can still exercise the retry functionality.

Note that this would mean there is no more use for a partitionWriter abstraction, which currently only exists so that it can be faked in tests.

@achille-roussel
Copy link
Contributor

Considering we reverted the change that introduced the original regression, I believe we can close this PR. Feel free to reopen if it needs further discussion 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants