Skip to content

Data race between compare_generic and insert_combined_dict under free-threading #132641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfdev-5 opened this issue Apr 17, 2025 · 4 comments · Fixed by #133787
Closed

Data race between compare_generic and insert_combined_dict under free-threading #132641

vfdev-5 opened this issue Apr 17, 2025 · 4 comments · Fixed by #133787
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@vfdev-5
Copy link
Contributor

vfdev-5 commented Apr 17, 2025

Bug report

Bug description:

I built main branch and observed the following races under free-threading in cpython 3.14 (Python 3.14.0a7+ experimental free-threading build (heads/main:e42bda94411, Apr 17 2025, 14:08:39) [Clang 18.1.3 (1ubuntu1)])

Race 1
WARNING: ThreadSanitizer: data race (pid=33872)
  Read of size 8 at 0x7fffc22d7ad8 by thread T5 (mutexes: read M0):
    #0 compare_generic /project/cpython/Objects/dictobject.c:1107:13 (python3.14+0x26edbc) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #1 do_lookup /project/cpython/Objects/dictobject.c:1013:23 (python3.14+0x26edbc)
    #2 dictkeys_generic_lookup /project/cpython/Objects/dictobject.c:1132:12 (python3.14+0x26edbc)
    #3 _Py_dict_lookup /project/cpython/Objects/dictobject.c:1298:14 (python3.14+0x26edbc)
    #4 _PyDict_GetItemRef_KnownHash_LockHeld /project/cpython/Objects/dictobject.c:2330:21 (python3.14+0x272a14) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
   ...

  Previous atomic write of size 8 at 0x7fffc22d7ad8 by thread T6 (mutexes: read M0):
    #0 _Py_atomic_store_ptr_release /project/cpython/./Include/cpython/pyatomic_gcc.h:565:3 (python3.14+0x284d66) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #1 insert_combined_dict /project/cpython/Objects/dictobject.c:1748:9 (python3.14+0x284d66)
    #2 insertdict /project/cpython/Objects/dictobject.c:1854:13 (python3.14+0x274254) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #3 _PyDict_SetItem_KnownHash_LockHeld /project/cpython/Objects/dictobject.c:2656:12 (python3.14+0x273be9) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #4 _PyDict_SetItem_KnownHash /project/cpython/Objects/dictobject.c:2673:11 (python3.14+0x274635) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
  ...
Race 2
WARNING: ThreadSanitizer: data race (pid=33872)
  Read of size 8 at 0x7fffc22d7ad0 by thread T5 (mutexes: read M0):
    #0 compare_generic /project/cpython/Objects/dictobject.c:1110:13 (python3.14+0x26edd5) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #1 do_lookup /project/cpython/Objects/dictobject.c:1013:23 (python3.14+0x26edd5)
    #2 dictkeys_generic_lookup /project/cpython/Objects/dictobject.c:1132:12 (python3.14+0x26edd5)
    #3 _Py_dict_lookup /project/cpython/Objects/dictobject.c:1298:14 (python3.14+0x26edd5)
    #4 _PyDict_GetItemRef_KnownHash_LockHeld /project/cpython/Objects/dictobject.c:2330:21 (python3.14+0x272a14) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
   ...

  Previous atomic write of size 8 at 0x7fffc22d7ad0 by thread T6 (mutexes: read M0):
    #0 insert_combined_dict /project/cpython/Objects/dictobject.c (python3.14+0x284d8c) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #1 insertdict /project/cpython/Objects/dictobject.c:1854:13 (python3.14+0x274254) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #2 _PyDict_SetItem_KnownHash_LockHeld /project/cpython/Objects/dictobject.c:2656:12 (python3.14+0x273be9) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    #3 _PyDict_SetItem_KnownHash /project/cpython/Objects/dictobject.c:2673:11 (python3.14+0x274635) (BuildId: d491981d76fd0b67bcb5e01a978d07da019800b8)
    ...

Full report: https://gist.github.com/vfdev-5/cbb9189043737d023b755191b62951cf

cc @hawkinsp

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

@vfdev-5 vfdev-5 added the type-bug An unexpected behavior, bug, or error label Apr 17, 2025
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading labels Apr 17, 2025
@thomashammett
Copy link

I'd like to work on this issue and submit a PR to fix it. Is anyone already working on this, or may I take it? Thank you.

@vstinner
Copy link
Member

I built main branch and observed the following races under free-threading in cpython 3.14

Which code did you run to trigger these data races?

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Apr 23, 2025

It is from JAX TSAN CI job running cpu tests suite. I can try to get a small reproducer.

@hawkinsp
Copy link
Contributor

hawkinsp commented May 9, 2025

We are still seeing this frequently in our CI.

I'm trying to come up with a reproducer for this and so far haven't succeeded. However, reading the code, I'm curious about the following:

In bounded_lru_cache_get_lock_held we hold a critical section of self, i.e., the lru cache object itself:

Py_BEGIN_CRITICAL_SECTION(self);

We then use that critical section to call _PyDict_GetItemRef_KnownHash_LockHeld with no other locking
int res = _PyDict_GetItemRef_KnownHash_LockHeld((PyDictObject *)self->cache, key_, hash_,
. In other words, the read part of the cache seems to assume that the critical section on the lru_cache object itself protects the dictionary state.

However on the write side of the cache, we start by acquiring a critical section on self as well (

result = bounded_lru_cache_update_lock_held(self, result, key, hash);
). However in some code paths we call _PyDict_SetItem_KnownHash (e.g.,
if (_PyDict_SetItem_KnownHash(self->cache, key, (PyObject *)link,
), which acquires a critical section on the dictionary object (
Py_BEGIN_CRITICAL_SECTION(op);
), which I think we should think of as implicitly releasing the outer lock.

In other words, we are confused about which critical section is protecting the dictionary state. Is it the critical section on the lru cache itself, or it the critical section on the dict it contains?

hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 9, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 9, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 9, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 12, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 12, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 13, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 13, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 13, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
hawkinsp added a commit to hawkinsp/cpython that referenced this issue May 13, 2025
…ection use.

The bounded_lru_cache code was using a critical section on the lru cache
object to protect dictionary accesses in some code paths, but using the
critical section on the dictionary itself to protect accesses in other
code paths. This led to races since not all threads agreed on which lock
they needed to be holding.

Instead, always use a critical section on the underlying dictionary,
rather than the lru cache object itself.

Fixes python#132641
kumaraditya303 pushed a commit that referenced this issue May 13, 2025
Fix race in `lru_cache` by acquiring critical section on the cache object itself and call the lock held variant of dict functions to modify the underlying dict.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 13, 2025
…GH-133787)

Fix race in `lru_cache` by acquiring critical section on the cache object itself and call the lock held variant of dict functions to modify the underlying dict.
(cherry picked from commit 9ad0c7b)

Co-authored-by: Peter Hawkins <[email protected]>
kumaraditya303 pushed a commit that referenced this issue May 14, 2025
…3787) (#133979)

gh-132641: fix race in `lru_cache` under free-threading (GH-133787)

Fix race in `lru_cache` by acquiring critical section on the cache object itself and call the lock held variant of dict functions to modify the underlying dict.
(cherry picked from commit 9ad0c7b)

Co-authored-by: Peter Hawkins <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants