Skip to content

Reduce lock held duration in ConcurrencyLimitingRequestThrottler #1957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

jasonk000
Copy link
Contributor

It might take some (small) time for callback handling when the throttler request proceeds to submission.

Before this change, the throttler proceed request will happen while holding the lock, preventing other tasks from proceeding when there is spare capacity and even preventing tasks from enqueuing until the callback completes.

By tracking the expected outcome, we can perform the callback outside of the lock. This means that request registration and submission can proceed even when a long callback is being processed.

@jasonk000
Copy link
Contributor Author

jasonk000 commented Sep 9, 2024

fwiw, looks like this would have a merge conflict with #1950 , but the same transform can be done and both would work together

@jasonk000
Copy link
Contributor Author

jasonk000 commented Sep 10, 2024

Some additional details

Before -- we can see the contention over on the right hand-side showing up as futex calls; only one thread at a time is allowed to do a LoadBalancing.. ::newQueryPlan or CqlRequestHandler::sendRequest even when there is plenty of capacity in the limiter. So, although those components are designed to operate concurrently and the CQL execution will happen in parallel, the preparation is forced to be single-threaded.
image

After -- this results in the free-flowing query-plan generation where multiple queries can do their own plan & submit queries completely in parallel.
image

Copy link
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @jasonk000 and fantastic analysis! It definitely does seem like this would cause pile requests upon handling the request in onThrottleReady.

I'm a tentative +1, assuming this gets updated after #1950 is merged. Willing to give this a quick second look.

Copy link
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love what you did with making the test based on waiting for the countdown latch 👍, will definitely make the test reliable. Had a few small suggestions

Copy link
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great, thanks! Just a small suggestion to assert that threads complete.

@tolbertam
Copy link
Contributor

Everything looks great, thank you!

@clohfink clohfink self-requested a review September 13, 2024 14:45
Copy link

@clohfink clohfink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed internally +1

@tolbertam
Copy link
Contributor

@jasonk000, we just got #1950 in, I rebased locally and there were no conflicts on this branch (does not compile though).

I went ahead and created a JIRA for this: CASSANDRA-19922

With #1950 merged, I think we just need to have signalCancel updated in the same way as the changes you made for signalTimeout.

Woud you mind rebasing your branch, making that change and then squashing all of your commits and including this in the last line of the commit commit message?

patch by Jason Koch; Reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-19922

After that I can merge it and we'll get this included in the next release! 🎉

It might take some (small) time for callback handling when the
throttler request proceeds to submission.

Before this change, the throttler proceed request will happen while
holding the lock, preventing other tasks from proceeding when there is
spare capacity and even preventing tasks from enqueuing until the
callback completes.

By tracking the expected outcome, we can perform the callback outside
of the lock. This means that request registration and submission can
proceed even when a long callback is being processed.

patch by Jason Koch; Reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-19922
@jasonk000 jasonk000 force-pushed the concurrency-limit-throttler-unblock branch from 19aaa14 to aa6e5ea Compare September 13, 2024 16:27
Copy link
Contributor

@akhaku akhaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also reviewed internally

@jasonk000
Copy link
Contributor Author

Thank you @tolbertam , appreciate the review feedback & guidance. Positive experience. Thanks!

@tolbertam
Copy link
Contributor

Thank you for the great fix and the tests @jasonk000! 🚀

@tolbertam tolbertam merged commit 6d3ba47 into apache:4.x Sep 13, 2024
@charispav
Copy link

Hi all,

While using the ConcurrencyLimitingRequestThrottler in our application (java driver 4.17.0), in case Cassandra gets overloaded and the throttling mechanism kicks in, we have a blocked thread situation:
image
As it is evident, the lock mechanism of throttler leads to the Vert.x IO thread (worker thread) being blocked. In general, the IO thread pool is used for possibly blocking operations. This block causes a deadlock, hence makes the entire application unresponsive.

From implementation perspective, we are using the reactive API (executeReactive method) for async query execution which in its turn uses the throttling mechanism internally.

Does the improvement for the lock mechanism in this PR also leads for the blocking issue to be fixed?
If not, how should this thread-block bug be handled? Should a separate Jira ticket be created?

@adutra
Copy link
Contributor

adutra commented Oct 21, 2024

@charispav this PR may reduce the symptoms you are seeing but imho won't get rid of them completely.

In fact this is explained in this chapter of the docs:

[Request throttlers] use locks internally, and depending on how many requests are being executed in parallel, the thread contention on these locks can be high: in short, if your application enforces strict lock-freedom, then these components should not be used.

If you are using the reactive API, I'd suggest that you try instead to throttle the upstream, e.g. using a token-bucket algorithm.

@charispav
Copy link

@adutra thanks for your prompt response.

In fact, even when Cassandra eventually returns to normal operation, the block remains leaving the application in a hanging state.

How would you explain this? Why does the blocked threads not get released afterwards?

Does the deadlock happen between our own Vert.x threads and DataStax threads?

How could we eliminate (or, at least mitigate) the issue if we prefer keeping this throttler without developing our own token-bucket-based solution? Is there any configuration option that might help?

@jasonk000
Copy link
Contributor Author

jasonk000 commented Oct 22, 2024

@charispav

I think the current implementation should be good except under very extreme scenarios, in which case you probably want a different design anyway.

It may be possible to develop a more-lock-free implementation, and I did prototype it for this, but it likely needs change to semantics/behaviour to go lock-free in the processing path, and I haven't personally yet seen the need to develop it.

If you suspect the driver itself has some deadlock or performance issues, the best way to proceed will be to share some stack traces and in particular the lines where the code is stalled and state of queue/counters.

@jasonk000 jasonk000 deleted the concurrency-limit-throttler-unblock branch October 22, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants