Skip to content

Commit af67e87

Browse files
tkoepperwgk
andauthored
docs/advanced A document about deadlock potential with C++ statics (#5394)
* [docs/advanced] A document about deadlock potential with C++ statics * [docs/advanced] Refer to deadlock.md from misc.rst * [docs/advanced] Fix tables in deadlock.md * Use :ref:`deadlock-reference-label` * Revert "Use :ref:`deadlock-reference-label`" This reverts commit e5734d2. * Add simple references to docs/advanced/deadlock.md filename. (Maybe someone can work on clickable links later.) --------- Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
1 parent 56e69a2 commit af67e87

File tree

3 files changed

+401
-1
lines changed

3 files changed

+401
-1
lines changed

docs/advanced/deadlock.md

Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
# Double locking, deadlocking, GIL
2+
3+
[TOC]
4+
5+
## Introduction
6+
7+
### Overview
8+
9+
In concurrent programming with locks, *deadlocks* can arise when more than one
10+
mutex is locked at the same time, and careful attention has to be paid to lock
11+
ordering to avoid this. Here we will look at a common situation that occurs in
12+
native extensions for CPython written in C++.
13+
14+
### Deadlocks
15+
16+
A deadlock can occur when more than one thread attempts to lock more than one
17+
mutex, and two of the threads lock two of the mutexes in different orders. For
18+
example, consider mutexes `mu1` and `mu2`, and threads T1 and T2, executing:
19+
20+
| | T1 | T2 |
21+
|--- | ------------------- | -------------------|
22+
|1 | `mu1.lock()`{.good} | `mu2.lock()`{.good}|
23+
|2 | `mu2.lock()`{.bad} | `mu1.lock()`{.bad} |
24+
|3 | `/* work */` | `/* work */` |
25+
|4 | `mu2.unlock()` | `mu1.unlock()` |
26+
|5 | `mu1.unlock()` | `mu2.unlock()` |
27+
28+
Now if T1 manages to lock `mu1` and T2 manages to lock `mu2` (as indicated in
29+
green), then both threads will block while trying to lock the respective other
30+
mutex (as indicated in red), but they are also unable to release the mutex that
31+
they have locked (step 5).
32+
33+
**The problem** is that it is possible for one thread to attempt to lock `mu1`
34+
and then `mu2`, and for another thread to attempt to lock `mu2` and then `mu1`.
35+
Note that it does not matter if either mutex is unlocked at any intermediate
36+
point; what matters is only the order of any attempt to *lock* the mutexes. For
37+
example, the following, more complex series of operations is just as prone to
38+
deadlock:
39+
40+
| | T1 | T2 |
41+
|--- | ------------------- | -------------------|
42+
|1 | `mu1.lock()`{.good} | `mu1.lock()`{.good}|
43+
|2 | waiting for T2 | `mu2.lock()`{.good}|
44+
|3 | waiting for T2 | `/* work */` |
45+
|3 | waiting for T2 | `mu1.unlock()` |
46+
|3 | `mu2.lock()`{.bad} | `/* work */` |
47+
|3 | `/* work */` | `mu1.lock()`{.bad} |
48+
|3 | `/* work */` | `/* work */` |
49+
|4 | `mu2.unlock()` | `mu1.unlock()` |
50+
|5 | `mu1.unlock()` | `mu2.unlock()` |
51+
52+
When the mutexes involved in a locking sequence are known at compile-time, then
53+
avoiding deadlocks is &ldquo;merely&rdquo; a matter of arranging the lock
54+
operations carefully so as to only occur in one single, fixed order. However, it
55+
is also possible for mutexes to only be determined at runtime. A typical example
56+
of this is a database where each row has its own mutex. An operation that
57+
modifies two rows in a single transaction (e.g. &ldquo;transferring an amount
58+
from one account to another&rdquo;) must lock two row mutexes, but the locking
59+
order cannot be established at compile time. In this case, a dynamic
60+
&ldquo;deadlock avoidance algorithm&rdquo; is needed. (In C++, `std::lock`
61+
provides such an algorithm. An algorithm might use a non-blocking `try_lock`
62+
operation on a mutex, which can either succeed or fail to lock the mutex, but
63+
returns without blocking.)
64+
65+
Conceptually, one could also consider it a deadlock if _the same_ thread
66+
attempts to lock a mutex that it has already locked (e.g. when some locked
67+
operation accidentally recurses into itself): `mu.lock();`{.good}
68+
`mu.lock();`{.bad} However, this is a slightly separate issue: Typical mutexes
69+
are either of _recursive_ or _non-recursive_ kind. A recursive mutex allows
70+
repeated locking and requires balanced unlocking. A non-recursive mutex can be
71+
implemented more efficiently, and/but for efficiency reasons does not actually
72+
guarantee a deadlock on second lock. Instead, the API simply forbids such use,
73+
making it a precondition that the thread not already hold the mutex, with
74+
undefined behaviour on violation.
75+
76+
### &ldquo;Once&rdquo; initialization
77+
78+
A common programming problem is to have an operation happen precisely once, even
79+
if requested concurrently. While it is clear that we need to track in some
80+
shared state somewhere whether the operation has already happened, it is worth
81+
noting that this state only ever transitions, once, from `false` to `true`. This
82+
is considerably simpler than a general shared state that can change values
83+
arbitrarily. Next, we also need a mechanism for all but one thread to block
84+
until the initialization has completed, which we can provide with a mutex. The
85+
simplest solution just always locks the mutex:
86+
87+
```c++
88+
// The "once" mechanism:
89+
constinit absl::Mutex mu(absl::kConstInit);
90+
constinit bool init_done = false;
91+
92+
// The operation of interest:
93+
void f();
94+
95+
void InitOnceNaive() {
96+
absl::MutexLock lock(&mu);
97+
if (!init_done) {
98+
f();
99+
init_done = true;
100+
}
101+
}
102+
```
103+
104+
This works, but the efficiency-minded reader will observe that once the
105+
operation has completed, all future lock contention on the mutex is
106+
unnecessary. This leads to the (in)famous &ldquo;double-locking&rdquo;
107+
algorithm, which was historically hard to write correctly. The idea is to check
108+
the boolean *before* locking the mutex, and avoid locking if the operation has
109+
already completed. However, accessing shared state concurrently when at least
110+
one access is a write is prone to causing a data race and needs to be done
111+
according to an appropriate concurrent programming model. In C++ we use atomic
112+
variables:
113+
114+
```c++
115+
// The "once" mechanism:
116+
constinit absl::Mutex mu(absl::kConstInit);
117+
constinit std::atomic<bool> init_done = false;
118+
119+
// The operation of interest:
120+
void f();
121+
122+
void InitOnceWithFastPath() {
123+
if (!init_done.load(std::memory_order_acquire)) {
124+
absl::MutexLock lock(&mu);
125+
if (!init_done.load(std::memory_order_relaxed)) {
126+
f();
127+
init_done.store(true, std::memory_order_release);
128+
}
129+
}
130+
}
131+
```
132+
133+
Checking the flag now happens without holding the mutex lock, and if the
134+
operation has already completed, we return immediately. After locking the mutex,
135+
we need to check the flag again, since multiple threads can reach this point.
136+
137+
*Atomic details.* Since the atomic flag variable is accessed concurrently, we
138+
have to think about the memory order of the accesses. There are two separate
139+
cases: The first, outer check outside the mutex lock, and the second, inner
140+
check under the lock. The outer check and the flag update form an
141+
acquire/release pair: *if* the load sees the value `true` (which must have been
142+
written by the store operation), then it also sees everything that happened
143+
before the store, namely the operation `f()`. By contrast, the inner check can
144+
use relaxed memory ordering, since in that case the mutex operations provide the
145+
necessary ordering: if the inner load sees the value `true`, it happened after
146+
the `lock()`, which happened after the `unlock()`, which happened after the
147+
store.
148+
149+
The C++ standard library, and Abseil, provide a ready-made solution of this
150+
algorithm called `std::call_once`/`absl::call_once`. (The interface is the same,
151+
but the Abseil implementation is possibly better.)
152+
153+
```c++
154+
// The "once" mechanism:
155+
constinit absl::once_flag init_flag;
156+
157+
// The operation of interest:
158+
void f();
159+
160+
void InitOnceWithCallOnce() {
161+
absl::call_once(once_flag, f);
162+
}
163+
```
164+
165+
Even though conceptually this is performing the same algorithm, this
166+
implementation has some considerable advantages: The `once_flag` type is a small
167+
and trivial, integer-like type and is trivially destructible. Not only does it
168+
take up less space than a mutex, it also generates less code since it does not
169+
have to run a destructor, which would need to be added to the program's global
170+
destructor list.
171+
172+
The final clou comes with the C++ semantics of a `static` variable declared at
173+
block scope: According to [[stmt.dcl]](https://eel.is/c++draft/stmt.dcl#3):
174+
175+
> Dynamic initialization of a block variable with static storage duration or
176+
> thread storage duration is performed the first time control passes through its
177+
> declaration; such a variable is considered initialized upon the completion of
178+
> its initialization. [...] If control enters the declaration concurrently while
179+
> the variable is being initialized, the concurrent execution shall wait for
180+
> completion of the initialization.
181+
182+
This is saying that the initialization of a local, `static` variable precisely
183+
has the &ldquo;once&rdquo; semantics that we have been discussing. We can
184+
therefore write the above example as follows:
185+
186+
```c++
187+
// The operation of interest:
188+
void f();
189+
190+
void InitOnceWithStatic() {
191+
static int unused = (f(), 0);
192+
}
193+
```
194+
195+
This approach is by far the simplest and easiest, but the big difference is that
196+
the mutex (or mutex-like object) in this implementation is no longer visible or
197+
in the user&rsquo;s control. This is perfectly fine if the initializer is
198+
simple, but if the initializer itself attempts to lock any other mutex
199+
(including by initializing another static variable!), then we have no control
200+
over the lock ordering!
201+
202+
Finally, you may have noticed the `constinit`s around the earlier code. Both
203+
`constinit` and `constexpr` specifiers on a declaration mean that the variable
204+
is *constant-initialized*, which means that no initialization is performed at
205+
runtime (the initial value is already known at compile time). This in turn means
206+
that a static variable guard mutex may not be needed, and static initialization
207+
never blocks. The difference between the two is that a `constexpr`-specified
208+
variable is also `const`, and a variable cannot be `constexpr` if it has a
209+
non-trivial destructor. Such a destructor also means that the guard mutex is
210+
needed after all, since the destructor must be registered to run at exit,
211+
conditionally on initialization having happened.
212+
213+
## Python, CPython, GIL
214+
215+
With CPython, a Python program can call into native code. To this end, the
216+
native code registers callback functions with the Python runtime via the CPython
217+
API. In order to ensure that the internal state of the Python runtime remains
218+
consistent, there is a single, shared mutex called the &ldquo;global interpreter
219+
lock&rdquo;, or GIL for short. Upon entry of one of the user-provided callback
220+
functions, the GIL is locked (or &ldquo;held&rdquo;), so that no other mutations
221+
of the Python runtime state can occur until the native callback returns.
222+
223+
Many native extensions do not interact with the Python runtime for at least some
224+
part of them, and so it is common for native extensions to _release_ the GIL, do
225+
some work, and then reacquire the GIL before returning. Similarly, when code is
226+
generally not holding the GIL but needs to interact with the runtime briefly, it
227+
will first reacquire the GIL. The GIL is reentrant, and constructions to acquire
228+
and subsequently release the GIL are common, and often don't worry about whether
229+
the GIL is already held.
230+
231+
If the native code is written in C++ and contains local, `static` variables,
232+
then we are now dealing with at least _two_ mutexes: the static variable guard
233+
mutex, and the GIL from CPython.
234+
235+
A common problem in such code is an operation with &ldquo;only once&rdquo;
236+
semantics that also ends up requiring the GIL to be held at some point. As per
237+
the above description of &ldquo;once&rdquo;-style techniques, one might find a
238+
static variable:
239+
240+
```c++
241+
// CPython callback, assumes that the GIL is held on entry.
242+
PyObject* InvokeWidget(PyObject* self) {
243+
static PyObject* impl = CreateWidget();
244+
return PyObject_CallOneArg(impl, self);
245+
}
246+
```
247+
248+
This seems reasonable, but bear in mind that there are two mutexes (the "guard
249+
mutex" and "the GIL"), and we must think about the lock order. Otherwise, if the
250+
callback is called from multiple threads, a deadlock may ensue.
251+
252+
Let us consider what we can see here: On entry, the GIL is already locked, and
253+
we are locking the guard mutex. This is one lock order. Inside the initializer
254+
`CreateWidget`, with both mutexes already locked, the function can freely access
255+
the Python runtime.
256+
257+
However, it is entirely possible that `CreateWidget` will want to release the
258+
GIL at one point and reacquire it later:
259+
260+
```c++
261+
// Assumes that the GIL is held on entry.
262+
// Ensures that the GIL is held on exit.
263+
PyObject* CreateWidget() {
264+
// ...
265+
Py_BEGIN_ALLOW_THREADS // releases GIL
266+
// expensive work, not accessing the Python runtime
267+
Py_END_ALLOW_THREADS // acquires GIL, #!
268+
// ...
269+
return result;
270+
}
271+
```
272+
273+
Now we have a second lock order: the guard mutex is locked, and then the GIL is
274+
locked (at `#!`). To see how this deadlocks, consider threads T1 and T2 both
275+
having the runtime attempt to call `InvokeWidget`. T1 locks the GIL and
276+
proceeds, locking the guard mutex and calling `CreateWidget`; T2 is blocked
277+
waiting for the GIL. Then T1 releases the GIL to do &ldquo;expensive
278+
work&rdquo;, and T2 awakes and locks the GIL. Now T2 is blocked trying to
279+
acquire the guard mutex, but T1 is blocked reacquiring the GIL (at `#!`).
280+
281+
In other words: if we want to support &ldquo;once-called&rdquo; functions that
282+
can arbitrarily release and reacquire the GIL, as is very common, then the only
283+
lock order that we can ensure is: guard mutex first, GIL second.
284+
285+
To implement this, we must rewrite our code. Naively, we could always release
286+
the GIL before a `static` variable with blocking initializer:
287+
288+
```c++
289+
// CPython callback, assumes that the GIL is held on entry.
290+
PyObject* InvokeWidget(PyObject* self) {
291+
Py_BEGIN_ALLOW_THREADS // releases GIL
292+
static PyObject* impl = CreateWidget();
293+
Py_END_ALLOW_THREADS // acquires GIL
294+
295+
return PyObject_CallOneArg(impl, self);
296+
}
297+
```
298+
299+
But similar to the `InitOnceNaive` example above, this code cycles the GIL
300+
(possibly descheduling the thread) even when the static variable has already
301+
been initialized. If we want to avoid this, we need to abandon the use of a
302+
static variable, since we do not control the guard mutex well enough. Instead,
303+
we use an operation whose mutex locking is under our control, such as
304+
`call_once`. For example:
305+
306+
```c++
307+
// CPython callback, assumes that the GIL is held on entry.
308+
PyObject* InvokeWidget(PyObject* self) {
309+
static constinit PyObject* impl = nullptr;
310+
static constinit std::atomic<bool> init_done = false;
311+
static constinit absl::once_flag init_flag;
312+
313+
if (!init_done.load(std::memory_order_acquire)) {
314+
Py_BEGIN_ALLOW_THREADS // releases GIL
315+
absl::call_once(init_flag, [&]() {
316+
PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
317+
impl = CreateWidget();
318+
PyGILState_Release(s); // releases GIL
319+
init_done.store(true, std::memory_order_release);
320+
});
321+
Py_END_ALLOW_THREADS // acquires GIL
322+
}
323+
324+
return PyObject_CallOneArg(impl, self);
325+
}
326+
```
327+
328+
The lock order is now always guard mutex first, GIL second. Unfortunately we
329+
have to duplicate the &ldquo;double-checked done flag&rdquo;, effectively
330+
leading to triple checking, because the flag state inside the `absl::once_flag`
331+
is not accessible to the user. In other words, we cannot ask `init_flag` whether
332+
it has been used yet.
333+
334+
However, we can perform one last, minor optimisation: since we assume that the
335+
GIL is held on entry, and again when the initializing operation returns, the GIL
336+
actually serializes access to our done flag variable, which therefore does not
337+
need to be atomic. (The difference to the previous, atomic code may be small,
338+
depending on the architecture. For example, on x86-64, acquire/release on a bool
339+
is nearly free ([demo](https://godbolt.org/z/P9vYWf4fE)).)
340+
341+
```c++
342+
// CPython callback, assumes that the GIL is held on entry, and indeed anywhere
343+
// directly in this function (i.e. the GIL can be released inside CreateWidget,
344+
// but must be reaqcuired when that call returns).
345+
PyObject* InvokeWidget(PyObject* self) {
346+
static constinit PyObject* impl = nullptr;
347+
static constinit bool init_done = false; // guarded by GIL
348+
static constinit absl::once_flag init_flag;
349+
350+
if (!init_done) {
351+
Py_BEGIN_ALLOW_THREADS // releases GIL
352+
// (multiple threads may enter here)
353+
absl::call_once(init_flag, [&]() {
354+
// (only one thread enters here)
355+
PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
356+
impl = CreateWidget();
357+
init_done = true; // (GIL is held)
358+
PyGILState_Release(s); // releases GIL
359+
});
360+
361+
Py_END_ALLOW_THREADS // acquires GIL
362+
}
363+
364+
return PyObject_CallOneArg(impl, self);
365+
}
366+
```
367+
368+
## Debugging tips
369+
370+
* Build with symbols.
371+
* <kbd>Ctrl</kbd>-<kbd>C</kbd> sends `SIGINT`, <kbd>Ctrl</kbd>-<kbd>\\</kbd>
372+
sends `SIGQUIT`. Both have their uses.
373+
* Useful `gdb` commands:
374+
* `py-bt` prints a Python backtrace if you are in a Python frame.
375+
* `thread apply all bt 10` prints the top-10 frames for each thread. A
376+
full backtrace can be prohibitively expensive, and the top few frames
377+
are often good enough.
378+
* `p PyGILState_Check()` shows whether a thread is holding the GIL. For
379+
all threads, run `thread apply all p PyGILState_Check()` to find out
380+
which thread is holding the GIL.
381+
* The `static` variable guard mutex is accessed with functions like
382+
`cxa_guard_acquire` (though this depends on ABI details and can vary).
383+
The guard mutex itself contains information about which thread is
384+
currently holding it.
385+
386+
## Links
387+
388+
* Article on
389+
[double-checked locking](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/)
390+
* [The Deadlock Empire](https://deadlockempire.github.io/), hands-on exercises
391+
to construct deadlocks

0 commit comments

Comments
 (0)