Skip to content

Race condition in TarantoolClientImpl #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nicktorwald opened this issue Mar 27, 2019 · 0 comments
Closed

Race condition in TarantoolClientImpl #142

nicktorwald opened this issue Mar 27, 2019 · 0 comments
Labels
bug Something isn't working

Comments

@nicktorwald
Copy link

It's possible to reach an inconsistent client state when a reconnection happens.
It leads the client to be broken and incapable to process user requests anymore.

@nicktorwald nicktorwald added the bug Something isn't working label Mar 27, 2019
nicktorwald pushed a commit that referenced this issue Mar 29, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Avoid LockSupport class usage for a thread to be suspended and woken
up. Actually, LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Addects: #34, #136
nicktorwald added a commit that referenced this issue Mar 29, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Avoid LockSupport class usage for a thread to be suspended and woken
up. Actually, LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 1, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 2, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 3, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 3, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 4, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 4, 2019
- Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.

- Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.

- Replace LockSupport with ReentrantLock.Condition for a thread to be
suspended and woken up. Our cluster tests and standalone demo app show
that LockSupport is not a safe memory barrier as it could be. The
reconnect thread relies on a visibility guarantee between park-unpark
invocations which, actually, sometimes doesn't work. Also, according to
java-docs LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 13, 2019
Refactor SocketChannelProvider implementations. Now we have two
SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by
simple and cluster clients respectively. To achieve this a
BaseSocketChannelProvider was extracted.

Add a service discovery implementation based on a Tarantool stored
procedure which is called to obtain a new list of cluster nodes.

Integrate service discovery and current cluster client. The client now
is aware of potential nodes changing using a configurable background job
which periodically checks whether node addresses have changed. If so
the client refreshes the RoundRobinSocketProviderImpl and replaces old
nodes by new ones. It also requires some additional effort in case of
missing the current node in the list. We need to reconnect to another
node as soon as possible with a minimal delay between client requests.
To achieve this we currently try to catch a moment when the requests in
progress have been accomplished and we can finish reconnection process.

Closes: #34
See also: #142
nicktorwald added a commit that referenced this issue Apr 13, 2019
Refactor SocketChannelProvider implementations. Now we have two
SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by
simple and cluster clients respectively. To achieve this a
BaseSocketChannelProvider was extracted.

Add a service discovery implementation based on a Tarantool stored
procedure which is called to obtain a new list of cluster nodes.

Integrate service discovery and current cluster client. The client now
is aware of potential nodes changing using a configurable background job
which periodically checks whether node addresses have changed. If so
the client refreshes the RoundRobinSocketProviderImpl and replaces old
nodes by new ones. It also requires some additional effort in case of
missing the current node in the list. We need to reconnect to another
node as soon as possible with a minimal delay between client requests.
To achieve this we currently try to catch a moment when the requests in
progress have been accomplished and we can finish reconnection process.

Closes: #34
See also: #142
nicktorwald added a commit that referenced this issue Apr 14, 2019
Refactor SocketChannelProvider implementations. Now we have two
SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by
simple and cluster clients respectively. To achieve this a
BaseSocketChannelProvider was extracted.

Add a service discovery implementation based on a Tarantool stored
procedure which is called to obtain a new list of cluster nodes.

Integrate service discovery and current cluster client. The client now
is aware of potential nodes changing using a configurable background job
which periodically checks whether node addresses have changed. If so
the client refreshes the RoundRobinSocketProviderImpl and replaces old
nodes by new ones. It also requires some additional effort in case of
missing the current node in the list. We need to reconnect to another
node as soon as possible with a minimal delay between client requests.
To achieve this we currently try to catch a moment when the requests in
progress have been accomplished and we can finish reconnection process.

Closes: #34
See also: #142
nicktorwald added a commit that referenced this issue Apr 15, 2019
A state of a client is a set of the following flags: {READING, WRITING,
RECONNECT, CLOSED}. Let's name a state when no flags are set
UNINITIALIZED.

A reader thread sets READING, performs reading until an error or an
interruption, drops READING and tries to trigger reconnection (if a
state allows, see below). A writer do quite same things, but with the
WRITING flag. The key point here is that a reconnection is triggered
from a reader/writer thread and only when certain conditions are met.

The prerequisite to set RECONNECT and signal (unpark) a connector thread
is that a client has UNINITIALIZED state.

There are several problems here:

- Say, a reader stalls a bit after dropping READING, then a writer drops
WRITING and trigger reconnection. Then reader wokes up and set RECONNECT
again.

- Calling unpark() N times for a connector thread when it is alive
doesn't lead to skipping next N park() calls, so the problem above is
not just about extra reconnection, but lead the connector thread to be
stuck.

- Say, a reader stalls just before setting READING. A writer is hit by
an IO error and triggers reconnection (set RECONNECT, unpark connector).
Then the reader wakes up and set READING+RECONNECT state that disallows
a connector thread to proceed further (it expects pure RECONNECT). Even
when the reader drops READING it will not wake up (unpark) the connector
thread, because RECONNECT was already set (state is not UNINITIALIZED).

This commit introduces several changes that eliminate the problems
above:

- Use ReentrantLock + Condition instead of park() / unpark() to never
miss signals to reconnect, does not matter whether a connector is
parked.

- Ensure a reader and a writer threads from one generation (that are
created on the same reconnection iteration) triggers reconnection once.

- Hold RECONNECT state most of time a connector works (while acquiring
a new socket, connecting and reading Tarantool greeting) and prevent to
set READING/WRITING while RECONNECT is set.

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 15, 2019
A state of a client is a set of the following flags: {READING, WRITING,
RECONNECT, CLOSED}. Let's name a state when no flags are set
UNINITIALIZED.

A reader thread sets READING, performs reading until an error or an
interruption, drops READING and tries to trigger reconnection (if a
state allows, see below). A writer do quite same things, but with the
WRITING flag. The key point here is that a reconnection is triggered
from a reader/writer thread and only when certain conditions are met.

The prerequisite to set RECONNECT and signal (unpark) a connector thread
is that a client has UNINITIALIZED state.

There are several problems here:

- Say, a reader stalls a bit after dropping READING, then a writer drops
WRITING and trigger reconnection. Then reader wokes up and set RECONNECT
again.

- Calling unpark() N times for a connector thread when it is alive
doesn't lead to skipping next N park() calls, so the problem above is
not just about extra reconnection, but lead the connector thread to be
stuck.

- Say, a reader stalls just before setting READING. A writer is hit by
an IO error and triggers reconnection (set RECONNECT, unpark connector).
Then the reader wakes up and set READING+RECONNECT state that disallows
a connector thread to proceed further (it expects pure RECONNECT). Even
when the reader drops READING it will not wake up (unpark) the connector
thread, because RECONNECT was already set (state is not UNINITIALIZED).

This commit introduces several changes that eliminate the problems
above:

- Use ReentrantLock + Condition instead of park() / unpark() to never
miss signals to reconnect, does not matter whether a connector is
parked.

- Ensure a reader and a writer threads from one generation (that are
created on the same reconnection iteration) triggers reconnection once.

- Hold RECONNECT state most of time a connector works (while acquiring
a new socket, connecting and reading Tarantool greeting) and prevent to
set READING/WRITING while RECONNECT is set.

- Ensure a new reconnection iteration will start only after old reader
and old writer threads exit (because we cannot receive a reconnection
signal until we send it).

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 18, 2019
A state of a client is a set of the following flags: {READING, WRITING,
RECONNECT, CLOSED}. Let's name a state when no flags are set
UNINITIALIZED.

A reader thread sets READING, performs reading until an error or an
interruption, drops READING and tries to trigger reconnection (if a
state allows, see below). A writer do quite same things, but with the
WRITING flag. The key point here is that a reconnection is triggered
from a reader/writer thread and only when certain conditions are met.

The prerequisite to set RECONNECT and signal (unpark) a connector thread
is that a client has UNINITIALIZED state.

There are several problems here:

- Say, a reader stalls a bit after dropping READING, then a writer drops
WRITING and trigger reconnection. Then reader wokes up and set RECONNECT
again.

- Calling unpark() N times for a connector thread when it is alive
doesn't lead to skipping next N park() calls, so the problem above is
not just about extra reconnection, but lead the connector thread to be
stuck.

- Say, a reader stalls just before setting READING. A writer is hit by
an IO error and triggers reconnection (set RECONNECT, unpark connector).
Then the reader wakes up and set READING+RECONNECT state that disallows
a connector thread to proceed further (it expects pure RECONNECT). Even
when the reader drops READING it will not wake up (unpark) the connector
thread, because RECONNECT was already set (state is not UNINITIALIZED).

This commit introduces several changes that eliminate the problems
above:

- Use ReentrantLock + Condition instead of park() / unpark() to never
miss signals to reconnect, does not matter whether a connector is
parked.

- Ensure a reader and a writer threads from one generation (that are
created on the same reconnection iteration) triggers reconnection once.

- Hold RECONNECT state most of time a connector works (while acquiring
a new socket, connecting and reading Tarantool greeting) and prevent to
set READING/WRITING while RECONNECT is set.

- Ensure a new reconnection iteration will start only after old reader
and old writer threads exit (because we cannot receive a reconnection
signal until we send it).

Fixes: #142
Affects: #34, #136
nicktorwald added a commit that referenced this issue Apr 18, 2019
Refactor SocketChannelProvider implementations. Now we have two
SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by
simple and cluster clients respectively. To achieve this a
BaseSocketChannelProvider was extracted.

Add a service discovery implementation based on a Tarantool stored
procedure which is called to obtain a new list of cluster nodes.

Integrate service discovery and current cluster client. The client now
is aware of potential nodes changing using a configurable background job
which periodically checks whether node addresses have changed. If so
the client refreshes the RoundRobinSocketProviderImpl and replaces old
nodes by new ones. It also requires some additional effort in case of
missing the current node in the list. We need to reconnect to another
node as soon as possible with a minimal delay between client requests.
To achieve this we currently try to catch a moment when the requests in
progress have been accomplished and we can finish reconnection process.

Closes: #34
See also: #142
nicktorwald added a commit that referenced this issue Apr 18, 2019
Refactor SocketChannelProvider implementations. Now we have two
SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by
simple and cluster clients respectively. To achieve this a
BaseSocketChannelProvider was extracted.

Add a service discovery implementation based on a Tarantool stored
procedure which is called to obtain a new list of cluster nodes.

Integrate service discovery and current cluster client. The client now
is aware of potential nodes changing using a configurable background job
which periodically checks whether node addresses have changed. If so
the client refreshes the RoundRobinSocketProviderImpl and replaces old
nodes by new ones. It also requires some additional effort in case of
missing the current node in the list. We need to reconnect to another
node as soon as possible with a minimal delay between client requests.
To achieve this we currently try to catch a moment when the requests in
progress have been accomplished and we can finish reconnection process.

Closes: #34
See also: #142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant