Skip to content

Fix potential lock on context cancel #391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2022

Conversation

hannahhoward
Copy link
Collaborator

@hannahhoward hannahhoward commented Jul 8, 2022

Goals

Fix an identified potential lock on imperative public function calls that take a context that can be cancelled.

Why

The graphsync RequestManager and ResponseManager run an internal run loop in a seperate go-routine. Public function calls access communicate with this run loop via a message queue channel. Sometimes these public function calls wait for a response on a channel, in order to return a value to the caller. When this is done, the message send into the run loop can abort if the calling context passed to the public method is cancelled. However, we were failing to account for this context when reading back out the response, which could cause a hang if the context passed into the public function is already cancelled at the time of calling. We found this did crop up in the wild.

Implementation

Add a calling context cancel check on the select statements in every place where we read a return value

For discussion

Contexts and channels suck and boy this code is verbose when can we use generics.

This is PR'd against a 0.13.3 release branch forked from v0.13.2 to avoid 0.14.x changes that are for go-data-transfer v2

@hannahhoward hannahhoward requested a review from rvagg July 8, 2022 03:52
@hannahhoward hannahhoward changed the base branch from main to release/v0.13.3 July 8, 2022 03:52
Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, although I haven't double-checked that you got all the public functions in both client.go's so I'm trusting you're all over that

@hannahhoward hannahhoward merged commit 07412b0 into release/v0.13.3 Oct 4, 2022
gammazero added a commit that referenced this pull request Mar 28, 2023
rvagg pushed a commit that referenced this pull request Mar 29, 2023
hannahhoward added a commit that referenced this pull request Apr 7, 2023
hannahhoward added a commit that referenced this pull request Aug 1, 2023
a previous fix #391, attempted to address a lock that occurred on context cancel. however in doing
so, it introduced a new lock. essentially, if a message was not sent to the
requestmanager/responsemanager go routine, waiting for a response to that message could last
indefinitely. the previous fix therefore stopped waiting when the calling context cancelled. However
once a message reaches the go routine of the requestmanager/responsemanager, it's important that
it's processed to completion, so that the the message loop doesn't lock. The proper fix is to detect
when the message is sent successfully, and if so, wait for it to be processed. If it isn't sent, we
can safely abort the go routine immediately.
hannahhoward added a commit that referenced this pull request Aug 1, 2023
a previous fix #391, attempted to address a lock that occurred on context cancel. however in doing
so, it introduced a new lock. essentially, if a message was not sent to the
requestmanager/responsemanager go routine, waiting for a response to that message could last
indefinitely. the previous fix therefore stopped waiting when the calling context cancelled. However
once a message reaches the go routine of the requestmanager/responsemanager, it's important that
it's processed to completion, so that the the message loop doesn't lock. The proper fix is to detect
when the message is sent successfully, and if so, wait for it to be processed. If it isn't sent, we
can safely abort the go routine immediately.
hannahhoward added a commit that referenced this pull request Aug 2, 2023
* fix(cancellation): handle message cancellation properly

a previous fix #391, attempted to address a lock that occurred on context cancel. however in doing
so, it introduced a new lock. essentially, if a message was not sent to the
requestmanager/responsemanager go routine, waiting for a response to that message could last
indefinitely. the previous fix therefore stopped waiting when the calling context cancelled. However
once a message reaches the go routine of the requestmanager/responsemanager, it's important that
it's processed to completion, so that the the message loop doesn't lock. The proper fix is to detect
when the message is sent successfully, and if so, wait for it to be processed. If it isn't sent, we
can safely abort the go routine immediately.

* fix(race): resolve race condition with test responses
hannahhoward added a commit to filecoin-project/boost-graphsync that referenced this pull request Aug 15, 2023
* fix(cancellation): handle message cancellation properly

a previous fix ipfs#391, attempted to address a lock that occurred on context cancel. however in doing
so, it introduced a new lock. essentially, if a message was not sent to the
requestmanager/responsemanager go routine, waiting for a response to that message could last
indefinitely. the previous fix therefore stopped waiting when the calling context cancelled. However
once a message reaches the go routine of the requestmanager/responsemanager, it's important that
it's processed to completion, so that the the message loop doesn't lock. The proper fix is to detect
when the message is sent successfully, and if so, wait for it to be processed. If it isn't sent, we
can safely abort the go routine immediately.

* fix(race): resolve race condition with test responses
hannahhoward added a commit to filecoin-project/boost-graphsync that referenced this pull request Aug 15, 2023
* fix(cancellation): handle message cancellation properly

a previous fix ipfs#391, attempted to address a lock that occurred on context cancel. however in doing
so, it introduced a new lock. essentially, if a message was not sent to the
requestmanager/responsemanager go routine, waiting for a response to that message could last
indefinitely. the previous fix therefore stopped waiting when the calling context cancelled. However
once a message reaches the go routine of the requestmanager/responsemanager, it's important that
it's processed to completion, so that the the message loop doesn't lock. The proper fix is to detect
when the message is sent successfully, and if so, wait for it to be processed. If it isn't sent, we
can safely abort the go routine immediately.

* fix(race): resolve race condition with test responses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants