Skip to content

Crash when generator frame proxies outlive their generator #125723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ncoghlan opened this issue Oct 19, 2024 · 12 comments
Open

Crash when generator frame proxies outlive their generator #125723

ncoghlan opened this issue Oct 19, 2024 · 12 comments
Labels
3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@ncoghlan
Copy link
Contributor

ncoghlan commented Oct 19, 2024

Crash report

What happened?

Checking some frame proxy behaviour at the interactive prompt, I encountered the following crash:

$ PYTHON_BASIC_REPL=1 python3.13
Python 3.13.0 (main, Oct  8 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> def g():
...     a = 1
...     yield locals(), sys._getframe().f_locals
...
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
{'a': b'print'}
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
{'a': 'input_trans_stack'}
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
Segmentation fault (core dumped)

(Crash was initially seen in the new REPL, the above reproduction in the basic REPL showed it wasn't specific to the new REPL).

Subsequent investigation suggests that the problem relates to the frame proxy outliving the eval loop that created it, as the following script was able to reliably reproduce the crash (as shown below):

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)
$ python3.13 ../_misc/gen_locals_exec.py
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': '_pickle.cpython-313-x86_64-linux-gnu.so'}}
Segmentation fault (core dumped)

Changing the code to explicitly keep the generator object alive:

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    gen = g()
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)

is sufficient to eliminate the crash:

$ python3.13 ../_misc/gen_locals_exec.py
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}

The crash is still eliminated, even when gen is created via exec rather than creating it in the main eval loop:

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("gen = g()", locals=ns)
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)
$ python3.13 ../_misc/gen_locals_exec.py
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}

CPython versions tested on:

3.13, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

No response

Linked PRs

@ncoghlan ncoghlan added the type-crash A hard crash of the interpreter, possibly with a core dump label Oct 19, 2024
@ncoghlan
Copy link
Contributor Author

ncoghlan commented Oct 19, 2024

@gaogaotiantian @markshannon Genuinely strange interaction here when frame proxies for generator objects outlive both their generator and their eval loop. Hard to hit in normal code, but pretty easy to hit at an interactive prompt (assuming you're messing about with locals proxies interactively for some reason).

I assume it relates to something going wrong with the handoff of frame state from the eval loop to the generator object (given the garbage data being emitted, maybe the frame proxy pointer is still referencing the no longer valid loop frame?) (Note: it's literally been years since I looked at that code, so I don't actually know how it currently works. This assumption is only based on a vague memory of how it used to work around 3.10 or so)

@ZeroIntensity ZeroIntensity added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 19, 2024
@gpshead gpshead added the 3.13 bugs and security fixes label Oct 19, 2024
@ncoghlan
Copy link
Contributor Author

ncoghlan commented Oct 20, 2024

Definitely looks to be some kind of refcounting issue when the proxy is the only thing still referencing the underlying frame:

>>> import sys
... def g():
...     a = 1
...     yield sys._getframe()
...
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{}
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{'a': ' '}

The initial ref creation in

((PyFrameLocalsProxyObject*)self)->frame = (PyFrameObject*)Py_NewRef(frame);
uses Py_NewRef, which seems fine.

But this ownership check in genobject.c looks highly suspicious now:

frame->frame_obj->f_frame->owner == FRAME_OWNED_BY_GENERATOR);

I initially thought that when sys._getframe() is called inside the generator, the frame ownership would have been flipped over to FRAME_OWNED_BY_FRAME_OBJECT, but reading the code suggests that isn't the case (see edit below). Further experimentation also suggests this as a plausible source of the problem:

>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> del gen
>>> gen_frame.f_locals
Segmentation fault (core dumped)

Assuming that analysis is correct, I think at the very least, both sys._getframe and framelocalsproxy_new need to ensure that the frame data is owned by either the C stack or the frame object, since it isn't safe for the generator or the thread object to clear the frame anymore. The take_ownership function can be used to transfer the data management to normal refcounting rules if necessary.

Leaving the data ownership alone for data owned by the C stack is based on take_ownership asserting that the frame isn't still owned by the C stack. However, I'm pretty sure that case isn't being handled correctly either, I just don't understand the data management intricacies well enough to make a concrete suggestion for improving matters.

Edit: I found the line I had missed that makes this work in the general case. Creating a Python frame object inherently grants ownership of the frame data to that linked frame object:

f->f_frame->owner = FRAME_OWNED_BY_FRAME_OBJECT;

So it looks like mere code introspection isn't going to be enough to track this one down, it's going to take experimentation on a local build.

The fact f_code behaves itself in the following example also makes it more likely this is a problem specific to the frame locals proxy rather than relating to a general frame data ownership regression:

>>> gen = g()
>>> gen_frame = next(gen)
>>> gen_frame is gen.gi_frame
True
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> del gen
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> gen_frame.f_locals
{'a': 'n'}

@ncoghlan ncoghlan changed the title Crash when generator frame proxies outlive their eval loop Crash when generator frame proxies outlive their generator Oct 20, 2024
@gaogaotiantian
Copy link
Member

@ncoghlan are you still investigating or you need help on this? I don't have any clue at this point so I'll need to dig into the code base as well. If this is too much for you I can take over. However, if you can figure it out and fix it, I'd appreciate it :)

@Zheaoli
Copy link
Contributor

Zheaoli commented Oct 22, 2024

I can not reproduce it on main branch but I can reproduce it on 3.13 branch. I think I can find more clue about the root cause.

@Zheaoli
Copy link
Contributor

Zheaoli commented Oct 22, 2024

Fun fact, after bisect, I found #119209 may fixed this issue in some circumstance. When I run the test code

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)

e9875ec Would be correct and crash in 73ab83b

BTW, I compile the code in ./configure --enable-optimizations

When I try to compile the CPython ./configure --with-pydebug --with-trace-refs --enable-profiling, the test code will crash in both main branch and 3.13 branch.

I think I may need some time to dive more deeper

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Nov 2, 2024

@gaogaotiantian I'm not currently investigating this one (my initial investigation was mostly to work out how the test suite had missed it, which turned out to be the "frame proxy must outlive its creating evaluation loop" criterion for hitting the crash). It would have been nice if the problem had been obvious merely on reading the code, but alas, the cause (whatever it turns out to be) isn't that simple.

@Zheaoli That's definitely an interesting find. It suggests we might be inadvertently creating a reference cycle somewhere, and that commit happened to change exactly when the cycle gets collected. (The locals proxy isn't supposed to be getting involved in any cycles that aren't explicitly created in the Python code, but there's clearly something unexpected going on, so inadvertently created cycles can't be ruled out in advance)

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Nov 2, 2024

Something that occurred to me is whether it might be something the specialising interpreter is doing with the refcount handling (or the generator frame state management), rather than being inherent in the C code itself.

It's the fact that this works:

    exec("gen = g()", locals=ns)
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)

while this potentially crashes:

    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)

that got me thinking in that direction, as the only substantive difference I can see between the two versions is that in the second form, g() never leaves the eval loop's stack (it is fed directly into the subsequent next() call), while in the first form it does (since it is bound to gen in the first eval loop).

If the eval loop ends up clearing a generator frame state it doesn't actually own when the eval loop terminates in the second scenario, that would explain that state being gone when the frame proxy tries to access it.

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Nov 8, 2024

I'm checking to see if this can still be reproduced now that #125038 has (nominally) been fixed in main (079875e) and 3.13 (bcc7227)

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Nov 8, 2024

Alas, the tip of 3.13 still crashes even with the initial set of #125038 changes merged:

acoghlan@TerminalMist:~/devel/cpython$ ./python ../_misc/gen_locals_crash.py
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': '_statistics.cpython-313-x86_64-linux-gnu.so'}}
Segmentation fault (core dumped)
acoghlan@TerminalMist:~/devel/cpython$ ./python --version
Python 3.13.0+
acoghlan@TerminalMist:~/devel/cpython$ git describe
v3.13.0-249-g9ecaee6a551

Current crasher implementation:

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)

@efimov-mikhail
Copy link
Contributor

Actually, I can reproduce this crash on main branch.
It definitely worth it to understand the reason of crash.

@efimov-mikhail
Copy link
Contributor

efimov-mikhail commented Nov 8, 2024

I've tried some variants of repro.
It seems like I've found a slightly simple version:

import sys

g_f_locals = {}
def g():
    global g_f_locals
    g_f_locals = sys._getframe().f_locals
    a = 0
    yield 0

direct_call = True
if direct_call:
    next(g())
else:
    gen = g(); next(gen)

print(g_f_locals)

If direct_call is True, then there is a crash on main branch.
If direct_call is False there is no crash.

efimov-mikhail added a commit to efimov-mikhail/cpython that referenced this issue Nov 18, 2024
sobolevn pushed a commit that referenced this issue Jan 22, 2025
… generator (#126956)

Co-authored-by: Kirill Podoprigora <[email protected]>
Co-authored-by: Alyssa Coghlan <[email protected]>
@efimov-mikhail
Copy link
Contributor

IMHO, this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

6 participants