-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-106581: Split CALL_PY_EXACT_ARGS
into uops
#107760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this, it definitely doesn't look easy. It's sort of a bummer that we need to special-case this much stuff, but I also don't see a nicer way of handling these issues than what you have here.
A few comments and questions, mostly for my own understanding:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that assert; then I'll review your PR, and hopefully you can then merge that, and I can handle the merge fallout.
Python/bytecodes.c
Outdated
@@ -2955,18 +2954,35 @@ dummy_func( | |||
PyCodeObject *code = (PyCodeObject *)func->func_code; | |||
DEOPT_IF(code->co_argcount != argcount, CALL); | |||
DEOPT_IF(!_PyThreadState_HasStackSpace(tstate, code->co_framesize), CALL); | |||
} | |||
|
|||
op(_INIT_CALL_PY_EXACT_ARGS, (method, callable, args[oparg] -- new_frame: _PyInterpreterFrame*)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this makes a frame, perhaps rename it to _MAKE_FRAME
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured there will be other uops making frames once we try to split CALL_PY_WITH_DEFAULTS, CALL_NO_KW_ALLOC_AND_ENTER_INIT, as well as BINARY_SUBSCR_GETITEM, LOAD_ATTR_PROPERTY, LOAD_ATTR_GETATTRIBUTE_OVERRIDDEN.
Anyway, uop names are easily changed.
SKIP_OVER(INLINE_CACHE_ENTRIES_CALL); | ||
} | ||
|
||
op(_PUSH_FRAME, (new_frame: _PyInterpreterFrame* -- unused)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _PUSH_FRAME
is just frame->return_offset = 0; DISPATCH_INLINED(new_frame);
, it would make sense to spell out DISPATCH_INLINED
to clarify which operations that need to be different for tier 1 and tier 2.
Something like:
op(_PUSH_FRAME, (new_frame: _PyInterpreterFrame* -- unused)) {
SAVE_FRAME_STATE(); // Equivalent to frame->prev_instr = next_instr - 1; _PyFrame_SetStackPointer(frame, stack_pointer);
frame->return_offset = 0;
new_frame->previous = frame;
frame = cframe.current_frame = new_frame;
CALL_STAT_INC(inlined_py_calls);
if (_Py_EnterRecursivePy(tstate)) {
goto exit_unwind;
}
START_FRAME(); // Equivalent to next_instr = frame->prev_instr + 1; stack_pointer =
stack_pointer = _PyFrame_GetStackPointer(frame);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: 2b3e6f2
In which case the code generators needs to know to push the temporary stack values to the real stack before SAVE_FRAME_STATE()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to study this more. A problem is that the Tier 1 and Tier 2 versions of _PUSH_FRAME
are so different. I am working on a mechanism to be able to say
#if TIER_ONE
<code for Tier 1>
#else
<code for Tier 2>
#endif
I'm not sure yet what you mean with your last remark about pushing temp stack values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing carefully the two versions of DISPATCH_INLINED
(adding frame->return_offset = 0
which precedes it in both cases):
In Tier 1:
frame->return_offset = 0;
assert(tstate->interp->eval_frame == NULL);
_PyFrame_SetStackPointer(frame, stack_pointer);
frame->prev_instr = next_instr - 1;
(NEW_FRAME)->previous = frame;
frame = cframe.current_frame = (NEW_FRAME);
CALL_STAT_INC(inlined_py_calls);
goto start_frame;
In Tier 2:
frame->return_offset = 0;
assert(tstate->interp->eval_frame == NULL);
_PyFrame_SetStackPointer(frame, stack_pointer);
frame->prev_instr -= 1;
(NEW_FRAME)->previous = frame;
frame = tstate->cframe->current_frame = (NEW_FRAME);
CALL_STAT_INC(inlined_py_calls);
stack_pointer = _PyFrame_GetStackPointer(frame);
ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
Diff:
@@ -1,8 +1,9 @@
frame->return_offset = 0;
assert(tstate->interp->eval_frame == NULL);
_PyFrame_SetStackPointer(frame, stack_pointer);
- frame->prev_instr = next_instr - 1;
+ frame->prev_instr -= 1;
(NEW_FRAME)->previous = frame;
- frame = cframe.current_frame = (NEW_FRAME);
+ frame = tstate->cframe->current_frame = (NEW_FRAME);
CALL_STAT_INC(inlined_py_calls);
- goto start_frame;
+ stack_pointer = _PyFrame_GetStackPointer(frame);
+ ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
Note that the Tier 2 version must be preceded by a SAVE_IP
, which does the equivalent of frame->prev_instr = next_instr
. If we had a Tier 1 version of SAVE_IP
we could include it in the macro definition:
macro(CALL_PY_EXACT_ARGS) =
unused/1 + // Skip over the counter
_CHECK_PEP_523 +
_CHECK_FUNCTION_EXACT_ARGS +
_CHECK_STACK_SPACE +
_INIT_CALL_PY_EXACT_ARGS +
SAVE_IP + // <-------------- added
_PUSH_FRAME;
which would reduce the special-casing in the code generator a bit (it would still need to do something special for SAVE_IP
to ensure that its oparg
has the right value, different from the oparg
of the macro (which is the argument count). This would take care of the first diff chunk (what to assign to frame->prev_inst
), but it would still be pretty fragile. (Like my current version, it would entice the optimizer to incorrectly try to remove the SAVE_IP
uop.)
The second diff chunk relates to how we set cframe.current_frame
-- in Tier 2 we must access this through the tstate
.
The third and final diff chunk relates to really start using the new frame. In Tier 1, this must actually do the following:
- Check recursion
- Load
stack_pointer
- Load
next_instr
- Dispatch to the next opcode.
This is done by the code at start_frame
.
In Tier 2 there is no start_frame
label (the only uop that can go to a label is EXIT_TRACE
, and of course DEOPT_IF
and ERROR_IF
also jump). So we load stack_frame
here. There is no direct equivalent to next_instr
, but we have to set ip_offset
, which SAVE_IP
adds to its oparg
to get the prev_instr
value. (This variable is a cache for frame->code->co_code_adaptive
, to save some memory loads, so whenever frame
changes we must update it.)
(More later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another thing though, and I think that is what Mark meant. In Tier 1 the code generation for macros is special-cased for _PUSH_FRAME
so that both the stack adjustment and the next_instr
adjustment are emitted before the _PUSH_FRAME
opcode. This is done so that the flushing of these variables to the frame in the DISPATCH_INLINED
macro flush the correct values.
But this is really ugly and unprincipled, and the logic is much hairier than the other special cases for _PUSH_FRAME
. One of Mark's ideas here is to make this special case look for uops using the SAVE_FRAME_STATE
macro rather than for the specific uop _PUSH_FRAME
. But detecting when to trigger the special case is only part of the problem -- IMO the worse problem is that the special case itself is so ugly:
dispatch_inlined_special_case = False
if mgr is managers[-1] and mgr.instr.always_exits.startswith("DISPATCH_INLINED") and mgr.instr.name == "_PUSH_FRAME":
dispatch_inlined_special_case = True
temp = mgr.final_offset.clone()
temp.deeper(StackEffect(UNUSED)) # Hack
out.stack_adjust(temp.deep, temp.high)
# Use clone() since adjust_inverse() mutates final_offset
mgr.adjust_inverse(mgr.final_offset.clone())
if cache_adjust:
out.emit(f"next_instr += {cache_adjust};")
The last 4 lines here, starting with # Use clone()
, occur further down too, for the normal case (after the final uop). I don't even recall why the temp.deeper()
call is needed!
I'll mull this over some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have addressed this. @markshannon Please have another look. Assuming the tests look okay I'll un-draft this.
Made this back into a draft; I need to (a) wait for Brandt's gh-107788, then (b) redo the split and tooling changes using Mark's ideas. |
The |
This is only the first step for doing `CALL` in Tier 2. The next step involves tracing into the called code object. After that we'll have to do the remaining `CALL` specialization. Finally we'll have to tweak various things like `KW_NAMES`, and possibly move the `NULL` (for method calls) *above* the callable. But those are things for future PRs. Note: this moves setting `frame->return_offset` directly in front of `DISPATCH_INLINED()`, to make it easier to move it into `_PUSH_FRAME`.
Closing and re-opening to retrigger CLA checks. Sorry for the noise. |
Instead, the special case is an opcode using SAVE_FRAME_STATE(). Introducing #if TIER_ONE and #if TIER_TWO so we can implement _PUSH_FRAME differently for both tiers.
Instead, we special-case SAVE_IP: - Its Tier 2 expansion sets oparg to the instruction offset - In Tier 1 it is a no-op (and skipped if present in a macro)
@markshannon I was hoping you'd review this. I added Unless you'd rather review #107925, which includes this (and #107793, which is the intermediate stage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uneasy about the introduction of the TIER_ONE
and TIER_TWO
macros.
It is a principle of the overall design that there is a single source of truth for the semantics of bytecodes.
It might appear that I'm being dogmatic, but the need for something like those macros often indicates an underlying problem that should be fixed independently.
In this case the problem is the cframe
. Loading and saving the IP needs to handled specially anyway and saving and loading the SP should be the same for both interpreters (but will need to be handled specially by the copy-and-patch compiler, so should be its own micro-op).
It is pushes the frame that differs. Removing cframe
will fix that.
The cframe
only exists as a performance hack to minimize the impact of tracing prior to PEP 669.
Python/executor.c
Outdated
@@ -30,6 +30,14 @@ | |||
#undef ENABLE_SPECIALIZATION | |||
#define ENABLE_SPECIALIZATION 0 | |||
|
|||
#undef SAVE_FRAME_STATE | |||
#define SAVE_FRAME_STATE() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a macro, I think the code generator needs to understand this.
Given that SAVE_FRAME_STATE
is basically SAVE_CURRENT_IP
followed by _PyFrame_SetStackPointer(frame, stack_pointer);
we could convert it to two micro-ops: SAVE_CURRENT_IP
and SAVE_SP
.
In general, we want to avoid macros in the generated C code.
The generated code can be explicit and verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are already many macros (and static inline functions) in the generated code. The generator recognizes the presence of SAVE_FRAME_STATE()
, but it doesn't expand it -- the C preprocessor can do that for us more easily. Currently we only do the expansion in the generator for things whose expansion requires information that only the generator has (like the stack adjustment for ERROR_IF
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are proposing that the macro expansion for CALL_PY_EXACT_ARGS
become
macro(CALL_PY_EXACT_ARGS) =
unused/1 + // Skip over the counter
_CHECK_PEP_523 +
_CHECK_FUNCTION_EXACT_ARGS +
_CHECK_STACK_SPACE +
_INIT_CALL_PY_EXACT_ARGS +
SAVE_IP + // Tier 2 only; special-cased oparg
SAVE_CURRENT_IP + // <------------------- Addition
_PUSH_FRAME;
where SAVE_CURRENT_IP
is something like this:
op(SAVE_CURRENT_IP, (--)) {
#if TIER_ONE
frame->prev_instr = next_instr - 1;
#endif
#if TIER_TWO
frame->prev_instr--;
#endif
}
Or we could special-case its expansion in the generator, potayto-potato. But it has to differ between tiers because in Tier 1 it must store next_instr whereas in Tier 2 it must rely on the preceding SAVE_IP to set frame->prev_instr
. (Ideally at some point in the future we won't need the prev_instr--
yet, but that's a tricky change.)
The _PyFrame_SetStackPointer(frame, stack_pointer);
call should be moved back into _PUSH_FRAME
(at the point where I currently call SAVE_FRAME_STATE
).
If I can get this to work I'll apply it and merge the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did get this working (see 05af848), and will test and benchmark it before merging it.
Note that there are still some #if TIER_ONE
and #if TIER_TWO
sections, but they are unavoidable.
Python/ceval_macros.h
Outdated
@@ -103,11 +103,16 @@ | |||
DISPATCH_GOTO(); \ | |||
} | |||
|
|||
#define SAVE_FRAME_STATE() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below about splitting this into SAVE_CURRENT_IP; SAVE_SP
Benchmark: 1.00x faster: https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20230816-3.13.0a0-05af848 IOW it doesn't slow CALL_PY_EXACT_ARGS down, which is all I care about. |
|
This finishes the work begun in gh-107760. When, while projecting a superblock, we encounter a call to a short, simple function, the superblock will now enter the function using `_PUSH_FRAME`, continue through it, and leave it using `_POP_FRAME`, and then continue through the original code. Multiple frame pushes and pops are even possible. It is also possible to stop appending to the superblock in the middle of a called function, when running out of space or encountering an unsupported bytecode.
This is only the first step for doing
CALL
in Tier 2. The next step involves tracing into the called code object. After that we'll have to do the remainingCALL
specialization. Finally we'll have to tweak various things likeKW_NAMES
, and possibly move theNULL
(for method calls) above the callable (that's 107788). But those are things for future PRs.Note: this moves setting
frame->return_offset
directly in front ofDISPATCH_INLINED()
, to make it easier to move it into_PUSH_FRAME
.CALL
's stack #107788_Py_EnterRecursivePy