-
Notifications
You must be signed in to change notification settings - Fork 19
Race condition in TarantoolClientImpl #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1b54ae0
to
8d8aed9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a 'FIXUP' commit on top of your changes: added comments and XXX notes.
See also comments below.
Can we trigger situations that are described in the commit message and write test cases for them? It would guarantee that, say, refactoring of StateHelper will really fix the races (it is hard to verify with eyes). AFAIU the all described cases can be triggered if we'll pause one of threads (or several ones?) in certain point of execution? We use errinj (error injection) approach in tarantool itself to test such cases. An errinj is a global variable that can be set from a test and lead to an infinite loop in one certain execution point, the loop will exits when the variable will be changed back. We also have errinjs that are pause a thread for some amount of time, but timeouts approach often lead to test failing under a load (say, parallel test run), so the preferred way is to pause / unpause from a test code explicitly. Errinjs code is included only in debug build and testing harness is able to determine whether tests are run against debug or release build and skip certain tests for release build.
Also I would highlight in the commit message the certain problem that you observed with LockSupport. Now it is not obvious that you decided to replace it because of a certain problem, not just in case. |
I thought how to better represent a connection state and want to share my current vision in sketchy form. Sorry, it is not much circumspect, more like a stream of consciousness. Consider this as 'a bag of ideas' and don't cycle if some part is unclear (it can be just wrong). There are five actors: reader, writer, connecter, discovery and service task executor. The last one is to perform several actions: say, disconnect and then connect that is needed for discovery. (Also maybe 'connect' action is more like one of service task executor actions and connecter should not be considered as the separate actor?) There are generations of a state. Each actor should only access state (say, socket) of its generation: maybe by copying it internally. It is needed because of two reasons. First, ensure that an actor have no ability to somehow change a state of other generation. Say, even if a writer of a previous generation stalls for a long time it cannot wake up and write to a socket from the next generation (that was created after reconnection). Second reason is to handle messages (FSM methods calls) from different actors and, say, don't reconnect twice when reader and writer both report an error. FSM should accept all messages and queue them. Each message should contain a generation. Queueing allows us to consider actions as atomic and think about 'what to do it X succeeds/fails and then Y arrives'. We need to elaborate all possible combinations. All state that is shared between actors should be accessed only using messages to FSM (maybe name it coordinator?). Maybe it should be hold in FSM / FSM context class? Each state change should be atomic with corresponding actions. So we either have 'stopped' state or 'connected', not something in a middle. If a connection process fails we resides in stopped state (who should trigger the next attemp to connect in the case? A service task?). As I see three states are enough: stopped, connected and closed (the terminal one). Don't sure we need a terminal state: we can have 'stopped, but trying to connect' and 'stopped and don't tried to connect'. Or maybe it does not have much sense if we'll create a new generation with its own state for each connect attempt? Don't sure. We should decide who owns a state / context: say, who is reposible to close a socket. It seems that an actor cannot be responsible for that, so a coordinator should. Also there is thought that splitting a state between generations looks as a half of work that is needed to handle several connections within one client. OTOH, don't sure we need it. An actor should be allowed to subscribe to certain state changes. A reader and a writer should subscribe only to its generation, while discovery exists over generations, so maybe it should subscibe to messages dispite generations. So when a writer send 'write error' message to a coordinator, a coordinator will create a service task that will send a reader 'going to disconnect' message, then close a socket and change a state to 'stopped'. Kinda that. We need to elaborate to which outgoing messages actors should subscribe. Whether it should be only state changes? It seems that it is not so, because we want to notify a reader about an error in writer before close a socket, yep? We also need to take care to propagation of an error message to a user, see also #30. We need to plan that work and decide whether it worth to do it in the scope of fixing races or postpone as a refactoring task. |
c6b0320
to
217646d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay.
953da96
to
1bb1f73
Compare
Here I propose changes that make reconnection awaiting looks more 'native' with surrounding code, because of using CountDownLatch'es instead of ReentrantLock + Conditon. I don't sure whether the change is good from performance point of view. Maybe don't. So, please, consider this patch as RFC. I have stripped comments about synchronization tricks, because I think about rewriting them to be more clear (at least for me), but failed that attempt. I'll start a discussion in PR #145 about them.
Hi! I have proposed some changes in Totktonada/gh-142-race-condition-proposal (80607e2). Here I just replace ReentrantLock + Conditon with CountDownLatch'es to make the code look more consistent. I don't sure it is good change, so, please, be critical. I have a point to discuss. I tried to understand the cases you described in the commit messages and found it is hard. I propose to describe them in a bit more formal way: provide sketchy pseudocode with threads actions and use it in the commit message and to describe things in the code comments.
Now we can provide a formal case:
Here I failed to make threads stuck and everything looks okay. I assume here that park / unpark (or Condition, or CountDownLatch'es works as expected). Can you please provide cases you described in the commit message in that way? Now I don't sure we should move RECONNECT state drop. |
Hm, it seems the proposal doesn't really work: https://travis-ci.org/tarantool/tarantool-java/jobs/519217538 |
Let me extend your notation a bit. With IO-threads generations, say, G-suffix. G1 stands for the first generation. Case 1 (assume the client is alive)
Case 2 (assume the client is alive)
|
Thanks for the cases! I want to propose another wording for them. I cited your original commit message and propose some wording that I think looks more clear below.
A state of a client is a set of the following flags: {READING, WRITING, RECONNECT, CLOSED}. Let's name a state when no flags are set UNINITIALIZED. A reader thread sets READING, performs reading until an error or an interruption, drops READING and tries to trigger reconnection (if a state allows, see below). A writer do quite same things, but with the WRITING flag. The key point here is that a reconnection is triggered from a reader/writer thread and only when certain conditions are met. The prerequisite to set RECONNECT and signal (unpark) a connector thread is that a client has UNINITIALIZED state. There are several problems here:
This commit introduces several changes that eliminate the problems above:
These changes do not guarantee that a stalled reader/writer thread from a previous connection iteration will not try to read/write something from/into a new socket and will not try to change a state, but highly descrease probability of this situation. @nicktorwald What do you think? Does it make things more clear? Sorry that I spent so much of your time on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more about comments.
7a3b334
to
08d39c6
Compare
Now I understood it is not so: we guarantee that a new reconnection iteration will not start until both old reader and writer will exit. So the cited sentence is not correct. I would also expand the 'changes that eliminate the problems above' list with the following:
|
08d39c6
to
bdbaac8
Compare
@nicktorwald Please, file an issue re connection state handling refactoring when this PR will be merged. |
A state of a client is a set of the following flags: {READING, WRITING, RECONNECT, CLOSED}. Let's name a state when no flags are set UNINITIALIZED. A reader thread sets READING, performs reading until an error or an interruption, drops READING and tries to trigger reconnection (if a state allows, see below). A writer do quite same things, but with the WRITING flag. The key point here is that a reconnection is triggered from a reader/writer thread and only when certain conditions are met. The prerequisite to set RECONNECT and signal (unpark) a connector thread is that a client has UNINITIALIZED state. There are several problems here: - Say, a reader stalls a bit after dropping READING, then a writer drops WRITING and trigger reconnection. Then reader wokes up and set RECONNECT again. - Calling unpark() N times for a connector thread when it is alive doesn't lead to skipping next N park() calls, so the problem above is not just about extra reconnection, but lead the connector thread to be stuck. - Say, a reader stalls just before setting READING. A writer is hit by an IO error and triggers reconnection (set RECONNECT, unpark connector). Then the reader wakes up and set READING+RECONNECT state that disallows a connector thread to proceed further (it expects pure RECONNECT). Even when the reader drops READING it will not wake up (unpark) the connector thread, because RECONNECT was already set (state is not UNINITIALIZED). This commit introduces several changes that eliminate the problems above: - Use ReentrantLock + Condition instead of park() / unpark() to never miss signals to reconnect, does not matter whether a connector is parked. - Ensure a reader and a writer threads from one generation (that are created on the same reconnection iteration) triggers reconnection once. - Hold RECONNECT state most of time a connector works (while acquiring a new socket, connecting and reading Tarantool greeting) and prevent to set READING/WRITING while RECONNECT is set. - Ensure a new reconnection iteration will start only after old reader and old writer threads exit (because we cannot receive a reconnection signal until we send it). Fixes: #142 Affects: #34, #136
bdbaac8
to
cc36724
Compare
Pushed in 2415808 |
Avoid a possible race between reading, writing and reconnecting
threads when a reconnection process is started.
It might have happened that the lagged thread (reading or writing) could
reset the state to RECONNECT after the reconnecting thread has already
started and set the state to 0. As a result, all next attempts to
reconnect will never happen. Now the reconnect thread holds on the state
as long as it is required.
Avoid another possible race between reading and writing threads when
they are started during the reconnection process.
It might have happened that one of the threads crashed when it was
starting and another slightly lagged thread set up its flag. It could
have led that the reconnecting thread saw RECONNECT + R/W state instead
of pure RECONNECT. Again, this case broke down all next reconnection
attempts. Now reading and writing threads take into account whether
RECONNECT state is already set or not.
Avoid LockSupport class usage for a thread to be suspended and woken
up. Actually, LockSupport is more like an internal component to build
high-level blocking primitives. It is not recommended using this class
directly. It was replaced by ReentrantLock.Condition primitive based
on LockSupport but which has proper LockSupport usage inside.
Fixes: #142
Affects: #34, #136