From 2fb97e3e1b9e1630b6adfbb9069eca4ce51488d5 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Fri, 22 Nov 2024 20:03:56 +0000 Subject: [PATCH 01/12] gh-119786: add intro to adaptive.md. Update index headings --- InternalDocs/README.md | 6 ++++-- InternalDocs/adaptive.md | 31 +++++++++++++++++++++++++------ InternalDocs/interpreter.md | 2 +- InternalDocs/tier2.md | 3 +++ 4 files changed, 33 insertions(+), 9 deletions(-) create mode 100644 InternalDocs/tier2.md diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 8cdd06d189f362..e085d592b34a08 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -34,9 +34,11 @@ Runtime Objects Program Execution --- -- [The Interpreter](interpreter.md) +- [The Basic Interpreter](interpreter.md) -- [Adaptive Instruction Families](adaptive.md) +- [The Specializing Interpreter](adaptive.md) + +- [The Tier 2 Interpreter (coming soon)](tier2.md) - [Garbage Collector Design](garbage_collector.md) diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md index 7cfa8e52310460..657d32eca3ee99 100644 --- a/InternalDocs/adaptive.md +++ b/InternalDocs/adaptive.md @@ -1,12 +1,31 @@ +# The specializing interpreter + +The specializing interpreter, which was introduced in +[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by +rewriting the bytecode based on runtime information. This is done by replacing +a generic instruction by a faster version that works for the case that this +program encounters. Each specializable instruction is responsible for rewriting +itself, using its [inline caches](interpreter.md#inline-cache-entries) for +bookkeeping. + +When a [`CodeObject`](code_objects.md) is created, the function +`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is +called to initialize the caches of all adaptive instructions. When an +adaptive instruction executes, it may attempt to specialize itself, +depending on the argument and the contents of its cache. This is done +by calling one of the `_Py_Specialize_XXX` functions in +[`Python/specialize.c`](../Python/specialize.c). + +The specialized instructions are responsible for checking that the special-case +assumptions still apply, and de-optimizing back to the generic version if not. + # Adding or extending a family of adaptive instructions. ## Families of instructions -The core part of [PEP 659](https://peps.python.org/pep-0659/) -(specializing adaptive interpreter) is the families of -instructions that perform the adaptive specialization. - -A family of instructions has the following fundamental properties: +A *family* of instructions consists of an adaptive instruction along with the +specialized instruction that it can be replaced by. +It has the following fundamental properties: * It corresponds to a single instruction in the code generated by the bytecode compiler. @@ -139,7 +158,7 @@ to eliminate the branches. ### Maintaining stats -Finally, take care that stats are gather correctly. +Finally, take care that stats are gathered correctly. After the last `DEOPT_IF` has passed, a hit should be recorded with `STAT_INC(BASE_INSTRUCTION, hit)`. After an optimization has been deferred in the adaptive instruction, diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index ab149e43471072..b2d036870b0581 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -135,7 +135,7 @@ Although they are represented by code units, cache entries do not conform to the `opcode` / `oparg` format. If an instruction has an inline cache, the layout of its cache is described by -a `struct` definition in (`pycore_code.h`)[../Include/internal/pycore_code.h]. +a `struct` definition in [`pycore_code.h`](../Include/internal/pycore_code.h). This allows us to access the cache by casting `next_instr` to a pointer to this `struct`. The size of such a `struct` must be independent of the machine architecture, word size and alignment requirements. For a 32-bit field, the `struct` should use `_Py_CODEUNIT field[2]`. diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md new file mode 100644 index 00000000000000..5799ec0280f151 --- /dev/null +++ b/InternalDocs/tier2.md @@ -0,0 +1,3 @@ +# The Tier 2 Interpreter + +Coming soon. From 24148714e28c8d1750f56ba4faa4f045e2755e84 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Fri, 22 Nov 2024 23:14:30 +0000 Subject: [PATCH 02/12] add content to tier2 --- InternalDocs/README.md | 2 +- InternalDocs/tier2.md | 66 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 66 insertions(+), 2 deletions(-) diff --git a/InternalDocs/README.md b/InternalDocs/README.md index e085d592b34a08..b4e2089c820b69 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -38,7 +38,7 @@ Program Execution - [The Specializing Interpreter](adaptive.md) -- [The Tier 2 Interpreter (coming soon)](tier2.md) +- [The Tier 2 Interpreter](tier2.md) - [Garbage Collector Design](garbage_collector.md) diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md index 5799ec0280f151..255ece3e1bddc8 100644 --- a/InternalDocs/tier2.md +++ b/InternalDocs/tier2.md @@ -1,3 +1,67 @@ # The Tier 2 Interpreter -Coming soon. +The [basic interpreter](interpreter.md), also referred to as the `tier 1` +interpreter, consists of a main loop that executes the bytecode instructions +generated by the [bytecode compiler](compiler.md) and their +[specializations](adaptive.md). Runtime optimization in tier 1 can only be +done for one instruction at a time. The `tier 2` interpreter is based on a +mechanism to replace an entire sequence of bytecode instructions, and this +enables optimizations that span multiple instructions. + +## The Optimizer and Executors + +The program begins running in tier 1, until a `JUMP_BACKWARD` instruction +determines that it is `hot` because the counter in its +[inline cache](interpreter.md#inline-cache-entries) indicates that is +executed more than some threshold number of times. It then calls the +function `_PyOptimizer_Optimize()` in +[`Python/optimizer.c`](../Python/optimizer.c), passing it the current +[frame](frames.md) and instruction pointer. `_PyOptimizer_Optimize()` +constructs an object of type +[`_PyExecutorObject`](Include/internal/pycore_optimizer.h) which implements +an optimized version of the instruction trace beginning at this jump. + +The optimizer determines where the trace ends, and the executor is set up +to either return to `tier 1` and resume execution, or transfer control +to another executor (see `_PyExitData` in Include/internal/pycore_optimizer.h). + +The executor is stored on the [`code object`](code_objects.md) of the frame, +in the `co_executors` field which is an array of executors. The start +instruction of the trace (the `JUMP_BACKWARD`) is replaced by an +`ENTER_EXECUTOR` instruction whose `oparg` is equal to the index of the +executor in `co_executors`. + +## The uop optimizer + +The optimizer that `_PyOptimizer_Optimize()` runs is configurable +via the `_Py_SetTier2Optimizer()` function (this is used in test +via `_testinternalcapi.set_optimizer()`.) + +The tier 2 optimizer, `_PyUOpOptimizer_Type`, is defined in +[`Python/optimizer.c`](../Python/optimizer.c). It translates +an instruction trace into a sequence of micro-ops by replacing +each bytecode by an equivalent sequence of micro-ops +(see `_PyOpcode_macro_expansion` in +[pycore_opcode_metadata.h](../Include/internal/pycore_opcode_metadata.h) +which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). +The micro-op sequence is then optimized by +`_Py_uop_analyze_and_optimize` in +[`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c). + +## Running a uop executor + +After a tier 1 `JUMP_BACKWARD` instruction invokes the uop optimizer +to create a tier 2 uop executor, it transfers control to this executor +via the `GOTO_TIER_TWO` macro, which jumps to `tier2_dispatch:` in +[`Python/ceval.c`](../Python/ceval.c), where there is a loops that +executes the micro-ops which are defined in +[`Python/executor_cases.c.h`](../Python/executor_cases.c.h). +This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached. + +## Invalidating Executors + +In addition to being stored on the code object, each executor is also +inserted into a list of all executors which is stored in the interpreter +state's `executor_list_head` field. This list is used when it is necessary +to invalidate executors because values that their construction depended +on may have changed. From e040267dfd3a2312305c60980a7fa96c9c3c87f1 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Sat, 23 Nov 2024 16:00:48 +0000 Subject: [PATCH 03/12] review comments --- InternalDocs/tier2.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md index 255ece3e1bddc8..52ad6b39ec6db5 100644 --- a/InternalDocs/tier2.md +++ b/InternalDocs/tier2.md @@ -13,8 +13,9 @@ enables optimizations that span multiple instructions. The program begins running in tier 1, until a `JUMP_BACKWARD` instruction determines that it is `hot` because the counter in its [inline cache](interpreter.md#inline-cache-entries) indicates that is -executed more than some threshold number of times. It then calls the -function `_PyOptimizer_Optimize()` in +executed more than some threshold number of times (see +[`backoff_counter_triggers`](../Include/internal/pycore_backoff.h)). +It then calls the function `_PyOptimizer_Optimize()` in [`Python/optimizer.c`](../Python/optimizer.c), passing it the current [frame](frames.md) and instruction pointer. `_PyOptimizer_Optimize()` constructs an object of type @@ -46,17 +47,23 @@ each bytecode by an equivalent sequence of micro-ops which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). The micro-op sequence is then optimized by `_Py_uop_analyze_and_optimize` in -[`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c). +[`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c) +and a `_PyUOpExecutor_Type` is created to contain it. -## Running a uop executor +## Running a uop executor on the tier 2 interpreter After a tier 1 `JUMP_BACKWARD` instruction invokes the uop optimizer to create a tier 2 uop executor, it transfers control to this executor -via the `GOTO_TIER_TWO` macro, which jumps to `tier2_dispatch:` in -[`Python/ceval.c`](../Python/ceval.c), where there is a loops that +via the `GOTO_TIER_TWO` macro. + +When tier 2 is enabled but the JIT is not (python was configured with +[`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)), +the executor jumps to `tier2_dispatch:` in +[`Python/ceval.c`](../Python/ceval.c), where there is a loop that executes the micro-ops which are defined in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h). -This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached. +This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached, +and execution returns to teh tier 1 interpreter. ## Invalidating Executors From e5802d7a55a4aac6a1a2813112c93836377fabd0 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 28 Nov 2024 02:05:12 +0000 Subject: [PATCH 04/12] add jit --- InternalDocs/README.md | 2 +- InternalDocs/tier2.md | 54 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 53 insertions(+), 3 deletions(-) diff --git a/InternalDocs/README.md b/InternalDocs/README.md index b4e2089c820b69..c3877f705caa18 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -38,7 +38,7 @@ Program Execution - [The Specializing Interpreter](adaptive.md) -- [The Tier 2 Interpreter](tier2.md) +- [The Tier 2 Interpreter and JIT](tier2.md) - [Garbage Collector Design](garbage_collector.md) diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md index 52ad6b39ec6db5..f80b0b5a344e85 100644 --- a/InternalDocs/tier2.md +++ b/InternalDocs/tier2.md @@ -60,8 +60,12 @@ When tier 2 is enabled but the JIT is not (python was configured with [`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)), the executor jumps to `tier2_dispatch:` in [`Python/ceval.c`](../Python/ceval.c), where there is a loop that -executes the micro-ops which are defined in -[`Python/executor_cases.c.h`](../Python/executor_cases.c.h). +executes the micro-ops. The micro-ops are are defined in +[`Python/executor_cases.c.h`](../Python/executor_cases.c.h), +which is generated by the build script +[`Tools/cases_generator/tier2_generator.py`](../Tools/cases_generator/tier2_generator.py) +from the bytecode definitions in +[`Python/bytecodes.c`](../Python/bytecodes.c). This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached, and execution returns to teh tier 1 interpreter. @@ -72,3 +76,49 @@ inserted into a list of all executors which is stored in the interpreter state's `executor_list_head` field. This list is used when it is necessary to invalidate executors because values that their construction depended on may have changed. + +## The JIT + +When the jit is enabled (python was configured with +[`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit), +the uop executor's `jit_code` field is populated with a pointer to a compiled +C function that implement the executor logic. This function's signature is +defined by `jit_func` in [`pycore_jit.h`](Include/internal/pycore_jit.h). +When the executor is invoked by `ENTER_EXECUTOR`, instead of jumping to +the uop interpreter at `tier2_dispatch`, the executor runs the function +that `jit_code` points to. This function returns the instruction pointer +of the next Tier 1 instruction that needs to execute. + +The generation of the jitted fuctions uses the copy-and-patch technique +which is described in +[Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/). +At its core are statically generated `stencils` for the implementation +of the micro ops, which are completed with runtime information while +the jitted code is constructed for an executor by +[`_PyJIT_Compile`](../Python/jit.c). + +The stencils are generated under the build target `regen-jit` by the scripts +in [`/Tools/jit`](/Tools/jit). This script reads +[`Python/executor_cases.c.h`](../Python/executor_cases.c.h) (which is +generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). For +each opcode, it constructs a `.c` file that contains a function for +implementing this opcode, with some runtime information injected. +This is done by replacing `CASE` by the bytecode definition in the +template file [`Tools/jit/template.c`](../Tools/jit/template.c). + +Each of the `.c` file is compiled by LLVM, to produce an object file +that contains a function that executes the opcode. These compiled +functions are used to generate the file +[`jit_stencils.h`](../jit_stencils.h), which contains the functions +that the JIT can use to emit code for each of the bytecodes. + +For Python maintainers this means that changes to the bytecodes and +their implementations do not require changes related to the JIT, +because everything the JIT needs is automatically generated from +[`Python/bytecodes.c`](../Python/bytecodes.c) at build time. + +See Also: + +* [Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode](https://arxiv.org/abs/2011.13127) + +* [PyCon 2024: Building a JIT compiler for CPython](https://www.youtube.com/watch?v=kMO3Ju0QCDo) From 0c96eb2511339355687f3bb85a6466461118d753 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 5 Dec 2024 17:11:23 +0000 Subject: [PATCH 05/12] merge adaptive.md into interpreter.md --- InternalDocs/README.md | 4 +- InternalDocs/code_objects.md | 5 + InternalDocs/compiler.md | 10 -- InternalDocs/interpreter.md | 198 ++++++++++++++++++++++++++++++----- InternalDocs/tier2.md | 8 +- 5 files changed, 184 insertions(+), 41 deletions(-) diff --git a/InternalDocs/README.md b/InternalDocs/README.md index c3877f705caa18..b777d602c3d6bb 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -34,9 +34,7 @@ Runtime Objects Program Execution --- -- [The Basic Interpreter](interpreter.md) - -- [The Specializing Interpreter](adaptive.md) +- [The Bytecode Interpreter](interpreter.md) - [The Tier 2 Interpreter and JIT](tier2.md) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index d4e28c6b238b48..ba7cab210fc087 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -18,6 +18,11 @@ Code objects are typically produced by the bytecode [compiler](compiler.md), although they are often written to disk by one process and read back in by another. The disk version of a code object is serialized using the [marshal](https://docs.python.org/dev/library/marshal.html) protocol. +When a [`CodeObject`](code_objects.md) is created, the function +`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is +called to initialize the caches of all adaptive instructions. This is +required because the on-disk format is a sequence of bytes, and +some of the caches need to be initialized with 16-bit values. Code objects are nominally immutable. Some fields (including `co_code_adaptive` and fields for runtime diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md index 9e99f348acbd8f..c257bfd9faf78f 100644 --- a/InternalDocs/compiler.md +++ b/InternalDocs/compiler.md @@ -595,16 +595,6 @@ Objects * [Exception Handling](exception_handling.md): Describes the exception table -Specializing Adaptive Interpreter -================================= - -Adding a specializing, adaptive interpreter to CPython will bring significant -performance improvements. These documents provide more information: - -* [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/). -* [Adding or extending a family of adaptive instructions](adaptive.md) - - References ========== diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index b2d036870b0581..ab105a72713f52 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -1,8 +1,4 @@ -The bytecode interpreter -======================== - -Overview --------- +# The bytecode interpreter This document describes the workings and implementation of the bytecode interpreter, the part of python that executes compiled Python code. Its @@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe `_PyEval_EvalFrameDefault()`. -Instruction decoding --------------------- +## Instruction decoding The first task of the interpreter is to decode the bytecode instructions. Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`). @@ -110,8 +105,7 @@ snippet decode a complete instruction: For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG` is rare) the actual code is different. -Jumps -===== +## Jumps Note that when the `switch` statement is reached, `next_instr` (the "instruction offset") already points to the next instruction. @@ -120,15 +114,14 @@ Thus, jump instructions can be implemented by manipulating `next_instr`: - A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`. - A jump backward sets `next_instr -= oparg`. -Inline cache entries -==================== +## Inline cache entries Some (specialized or specializable) instructions have an associated "inline cache". The inline cache consists of one or more two-byte entries included in the bytecode array as additional words following the `opcode`/`oparg` pair. The size of the inline cache for a particular instruction is fixed by its `opcode`. Moreover, the inline cache size for all instructions in a -[family of specialized/specializable instructions](adaptive.md) +[family of specialized/specializable instructions](#Specialization) (for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be the same. Cache entries are reserved by the compiler and initialized with zeros. Although they are represented by code units, cache entries do not conform to the @@ -153,8 +146,7 @@ Serializing non-zero cache entries would present a problem because the serializa More information about the use of inline caches can be found in [PEP 659](https://peps.python.org/pep-0659/#ancillary-data). -The evaluation stack --------------------- +## The evaluation stack Most instructions read or write some data in the form of object references (`PyObject *`). The CPython bytecode interpreter is a stack machine, meaning that its instructions operate @@ -193,16 +185,14 @@ For example, the following sequence is illegal, because it keeps pushing items o > Do not confuse the evaluation stack with the call stack, which is used to implement calling > and returning from functions. -Error handling --------------- +## Error handling When the implementation of an opcode raises an exception, it jumps to the `exception_unwind` label in [Python/ceval.c](../Python/ceval.c). The exception is then handled as described in the [`exception handling documentation`](exception_handling.md#handling-exceptions). -Python-to-Python calls ----------------------- +## Python-to-Python calls The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes the interpreter calls some C function that calls back into the interpreter. @@ -227,8 +217,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller. A similar check is performed when an unhandled exception occurs. -The call stack --------------- +## The call stack Up through 3.10, the call stack was implemented as a singly-linked list of [frame objects](frames.md). This was expensive because each call would require a @@ -262,8 +251,7 @@ See also the [generators](generators.md) section. -Introducing a new bytecode instruction --------------------------------------- +## Introducing a new bytecode instruction It is occasionally necessary to add a new opcode in order to implement a new feature or change the way that existing features are compiled. @@ -355,6 +342,169 @@ new bytecode properly. Run `make regen-importlib` for updating the bytecode of frozen importlib files. You have to run `make` again after this to recompile the generated C files. +## Specialization + +Bytecode specialization, which was introduced in +[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by +rewriting instructions based on runtime information. This is done by replacing +a generic instruction with a faster version that works for the case that this +program encounters. Each specializable instruction is responsible for rewriting +itself, using its [inline caches](#inline-cache-entries) for +bookkeeping. + +When an adaptive instruction executes, it may attempt to specialize itself, +depending on the argument and the contents of its cache. This is done +by calling one of the `_Py_Specialize_XXX` functions in +[`Python/specialize.c`](../Python/specialize.c). + + +The specialized instructions are responsible for checking that the special-case +assumptions still apply, and de-optimizing back to the generic version if not. + +## Families of instructions + +A *family* of instructions consists of an adaptive instruction along with the +specialized instruction that it can be replaced by. +It has the following fundamental properties: + +* It corresponds to a single instruction in the code + generated by the bytecode compiler. +* It has a single adaptive instruction that records an execution count and, + at regular intervals, attempts to specialize itself. If not specializing, + it executes the base implementation. +* It has at least one specialized form of the instruction that is tailored + for a particular value or set of values at runtime. +* All members of the family must have the same number of inline cache entries, + to ensure correct execution. + Individual family members do not need to use all of the entries, + but must skip over any unused entries when executing. + +The current implementation also requires the following, +although these are not fundamental and may change: + +* All families use one or more inline cache entries, + the first entry is always the counter. +* All instruction names should start with the name of the adaptive + instruction. +* Specialized forms should have names describing their specialization. + +## Example family + +The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c)) +already has an adaptive family that serves as a relatively simple example. + +The `LOAD_GLOBAL` instruction performs adaptive specialization, +calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero. + +There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE` +which is specialized for global variables in the module, and +`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables. + +## Performance analysis + +The benefit of a specialization can be assessed with the following formula: +`Tbase/Tadaptive`. + +Where `Tbase` is the mean time to execute the base instruction, +and `Tadaptive` is the mean time to execute the specialized and adaptive forms. + +`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)` + +`Ti` is the time to execute the `i`th instruction in the family and `Ni` is +the number of times that instruction is executed. +`Tmiss` is the time to process a miss, including de-optimzation +and the time to execute the base instruction. + +The ideal situation is where misses are rare and the specialized +forms are much faster than the base instruction. +`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`. +In which case we have `Tadaptive ≈ sum(Ti*Ni)`. +Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and +`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, +we would expect the specialization of `LOAD_GLOBAL` to be profitable. + +## Design considerations + +While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and +`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti` +low for all specialized instructions and `Nmiss` as low as possible. + +Keeping `Nmiss` low means that there should be specializations for almost +all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means +keeping `Ti` low which means minimizing branches and dependent memory +accesses (pointer chasing). These two objectives may be in conflict, +requiring judgement and experimentation to design the family of instructions. + +The size of the inline cache should as small as possible, +without impairing performance, to reduce the number of +`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache. + +### Gathering data + +Before choosing how to specialize an instruction, it is important to gather +some data. What are the patterns of usage of the base instruction? +Data can best be gathered by instrumenting the interpreter. Since a +specialization function and adaptive instruction are going to be required, +instrumentation can most easily be added in the specialization function. + +### Choice of specializations + +The performance of the specializing adaptive interpreter relies on the +quality of specialization and keeping the overhead of specialization low. + +Specialized instructions must be fast. In order to be fast, +specialized instructions should be tailored for a particular +set of values that allows them to: + +1. Verify that incoming value is part of that set with low overhead. +2. Perform the operation quickly. + +This requires that the set of values is chosen such that membership can be +tested quickly and that membership is sufficient to allow the operation to +performed quickly. + +For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` +dictionaries that have a keys with the expected version. + +This can be tested quickly: + +* `globals->keys->dk_version == expected_version` + +and the operation can be performed quickly: + +* `value = entries[cache->index].me_value;`. + +Because it is impossible to measure the performance of an instruction without +also measuring unrelated factors, the assessment of the quality of a +specialization will require some judgement. + +As a general rule, specialized instructions should be much faster than the +base instruction. + +### Implementation of specialized instructions + +In general, specialized instructions should be implemented in two parts: + +1. A sequence of guards, each of the form + `DEOPT_IF(guard-condition-is-false, BASE_NAME)`. +2. The operation, which should ideally have no branches and + a minimum number of dependent memory accesses. + +In practice, the parts may overlap, as data required for guards +can be re-used in the operation. + +If there are branches in the operation, then consider further specialization +to eliminate the branches. + +### Maintaining stats + +Finally, take care that stats are gathered correctly. +After the last `DEOPT_IF` has passed, a hit should be recorded with +`STAT_INC(BASE_INSTRUCTION, hit)`. +After an optimization has been deferred in the adaptive instruction, +that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`. + + Additional resources -------------------- diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md index f80b0b5a344e85..fb42b91ddc067c 100644 --- a/InternalDocs/tier2.md +++ b/InternalDocs/tier2.md @@ -3,10 +3,10 @@ The [basic interpreter](interpreter.md), also referred to as the `tier 1` interpreter, consists of a main loop that executes the bytecode instructions generated by the [bytecode compiler](compiler.md) and their -[specializations](adaptive.md). Runtime optimization in tier 1 can only be -done for one instruction at a time. The `tier 2` interpreter is based on a -mechanism to replace an entire sequence of bytecode instructions, and this -enables optimizations that span multiple instructions. +[specializations](interpreter.md#Specialization). Runtime optimization in tier 1 +can only be done for one instruction at a time. The `tier 2` interpreter is +based on a mechanism to replace an entire sequence of bytecode instructions, +and this enables optimizations that span multiple instructions. ## The Optimizer and Executors From 43aa0dfd62f7da5b402f942e483e007fe9cddf6c Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 5 Dec 2024 17:22:59 +0000 Subject: [PATCH 06/12] cache layouts --- InternalDocs/interpreter.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index ab105a72713f52..de24e78fcaa307 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -127,11 +127,13 @@ the same. Cache entries are reserved by the compiler and initialized with zeros Although they are represented by code units, cache entries do not conform to the `opcode` / `oparg` format. -If an instruction has an inline cache, the layout of its cache is described by -a `struct` definition in [`pycore_code.h`](../Include/internal/pycore_code.h). -This allows us to access the cache by casting `next_instr` to a pointer to this `struct`. -The size of such a `struct` must be independent of the machine architecture, word size -and alignment requirements. For a 32-bit field, the `struct` should use `_Py_CODEUNIT field[2]`. +If an instruction has an inline cache, the layout of its cache is described in +the instruction's definition in [`Python/bytecodes.c`](../Python/bytecodes.c). +The structs defined in [`pycore_code.h`](../Include/internal/pycore_code.h) +allow us to access the cache by casting `next_instr` to a pointer to the relevant +`struct`. The size of such a `struct` must be independent of the machine +architecture, word size and alignment requirements. For a 32-bit field, the +`struct` should use `_Py_CODEUNIT field[2]`. The instruction implementation is responsible for advancing `next_instr` past the inline cache. For example, if an instruction's inline cache is four bytes (that is, two code units) in size, From e7c3363bfef367fab5956a82c8044787882a0095 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 5 Dec 2024 18:20:01 +0000 Subject: [PATCH 07/12] tier 2 --> jit --- InternalDocs/README.md | 2 +- InternalDocs/{tier2.md => jit.md} | 74 ++++++++++++++++--------------- 2 files changed, 40 insertions(+), 36 deletions(-) rename InternalDocs/{tier2.md => jit.md} (68%) diff --git a/InternalDocs/README.md b/InternalDocs/README.md index b777d602c3d6bb..794b4f3c6aad42 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -36,7 +36,7 @@ Program Execution - [The Bytecode Interpreter](interpreter.md) -- [The Tier 2 Interpreter and JIT](tier2.md) +- [The JIT](jit.md) - [Garbage Collector Design](garbage_collector.md) diff --git a/InternalDocs/tier2.md b/InternalDocs/jit.md similarity index 68% rename from InternalDocs/tier2.md rename to InternalDocs/jit.md index fb42b91ddc067c..6ac4a26a0e6965 100644 --- a/InternalDocs/tier2.md +++ b/InternalDocs/jit.md @@ -1,18 +1,21 @@ -# The Tier 2 Interpreter - -The [basic interpreter](interpreter.md), also referred to as the `tier 1` -interpreter, consists of a main loop that executes the bytecode instructions -generated by the [bytecode compiler](compiler.md) and their -[specializations](interpreter.md#Specialization). Runtime optimization in tier 1 -can only be done for one instruction at a time. The `tier 2` interpreter is -based on a mechanism to replace an entire sequence of bytecode instructions, +# The JIT + +The [adaptive interpreter](interpreter.md) consists of a main loop that +executes the bytecode instructions generated by the +[bytecode compiler](compiler.md) and their +[specializations](interpreter.md#Specialization). Runtime optimization in +this interpreter can only be done for one instruction at a time. The JIT +is based on a mechanism to replace an entire sequence of bytecode instructions, and this enables optimizations that span multiple instructions. +Historically, the adaptive interpreter was referred to as `tier 1` and +the JIT as `tier 2`. You will see remnants of this in the code. + ## The Optimizer and Executors -The program begins running in tier 1, until a `JUMP_BACKWARD` instruction -determines that it is `hot` because the counter in its -[inline cache](interpreter.md#inline-cache-entries) indicates that is +The program begins running on the adaptive interpreter, until a `JUMP_BACKWARD` +instruction determines that it is "hot" because the counter in its +[inline cache](interpreter.md#inline-cache-entries) indicates that it executed more than some threshold number of times (see [`backoff_counter_triggers`](../Include/internal/pycore_backoff.h)). It then calls the function `_PyOptimizer_Optimize()` in @@ -23,8 +26,9 @@ constructs an object of type an optimized version of the instruction trace beginning at this jump. The optimizer determines where the trace ends, and the executor is set up -to either return to `tier 1` and resume execution, or transfer control -to another executor (see `_PyExitData` in Include/internal/pycore_optimizer.h). +to either return to the adaptive interpreter and resume execution, or +transfer control to another executor (see `_PyExitData` in +Include/internal/pycore_optimizer.h). The executor is stored on the [`code object`](code_objects.md) of the frame, in the `co_executors` field which is an array of executors. The start @@ -32,17 +36,17 @@ instruction of the trace (the `JUMP_BACKWARD`) is replaced by an `ENTER_EXECUTOR` instruction whose `oparg` is equal to the index of the executor in `co_executors`. -## The uop optimizer +## The micro-op optimizer -The optimizer that `_PyOptimizer_Optimize()` runs is configurable -via the `_Py_SetTier2Optimizer()` function (this is used in test -via `_testinternalcapi.set_optimizer()`.) +The optimizer that `_PyOptimizer_Optimize()` runs is configurable via the +`_Py_SetTier2Optimizer()` function (this is used in test via +`_testinternalcapi.set_optimizer()`.) -The tier 2 optimizer, `_PyUOpOptimizer_Type`, is defined in -[`Python/optimizer.c`](../Python/optimizer.c). It translates -an instruction trace into a sequence of micro-ops by replacing -each bytecode by an equivalent sequence of micro-ops -(see `_PyOpcode_macro_expansion` in +The micro-op optimizer (abbreviated `uop` to approximate `μop`) is defined in +[`Python/optimizer.c`](../Python/optimizer.c) as the type `_PyUOpOptimizer_Type`. +It translates an instruction trace into a sequence of micro-ops by replacing +each bytecode by an equivalent sequence of micro-ops (see +`_PyOpcode_macro_expansion` in [pycore_opcode_metadata.h](../Include/internal/pycore_opcode_metadata.h) which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). The micro-op sequence is then optimized by @@ -50,13 +54,13 @@ The micro-op sequence is then optimized by [`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c) and a `_PyUOpExecutor_Type` is created to contain it. -## Running a uop executor on the tier 2 interpreter +## Debugging a uop executor in the JIT interpreter -After a tier 1 `JUMP_BACKWARD` instruction invokes the uop optimizer -to create a tier 2 uop executor, it transfers control to this executor -via the `GOTO_TIER_TWO` macro. +After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop +executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro. -When tier 2 is enabled but the JIT is not (python was configured with +When the JIT is configured to run on its interpreter (i.e., python is +configured with [`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)), the executor jumps to `tier2_dispatch:` in [`Python/ceval.c`](../Python/ceval.c), where there is a loop that @@ -67,19 +71,19 @@ which is generated by the build script from the bytecode definitions in [`Python/bytecodes.c`](../Python/bytecodes.c). This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached, -and execution returns to teh tier 1 interpreter. +and execution returns to the adaptive interpreter. ## Invalidating Executors In addition to being stored on the code object, each executor is also -inserted into a list of all executors which is stored in the interpreter +inserted into a list of all executors, which is stored in the interpreter state's `executor_list_head` field. This list is used when it is necessary -to invalidate executors because values that their construction depended -on may have changed. +to invalidate executors because values they used in their construction may +have changed. ## The JIT -When the jit is enabled (python was configured with +When the full jit is enabled (python was configured with [`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit), the uop executor's `jit_code` field is populated with a pointer to a compiled C function that implement the executor logic. This function's signature is @@ -89,7 +93,7 @@ the uop interpreter at `tier2_dispatch`, the executor runs the function that `jit_code` points to. This function returns the instruction pointer of the next Tier 1 instruction that needs to execute. -The generation of the jitted fuctions uses the copy-and-patch technique +The generation of the jitted functions uses the copy-and-patch technique which is described in [Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/). At its core are statically generated `stencils` for the implementation @@ -113,8 +117,8 @@ functions are used to generate the file that the JIT can use to emit code for each of the bytecodes. For Python maintainers this means that changes to the bytecodes and -their implementations do not require changes related to the JIT, -because everything the JIT needs is automatically generated from +their implementations do not require changes related to the stencils, +because everything is automatically generated from [`Python/bytecodes.c`](../Python/bytecodes.c) at build time. See Also: From 4731d94a79b9d061cc35638ce5a6b37b903efef5 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 5 Dec 2024 20:42:30 +0000 Subject: [PATCH 08/12] remove adaptive.md --- InternalDocs/adaptive.md | 165 --------------------------------------- 1 file changed, 165 deletions(-) delete mode 100644 InternalDocs/adaptive.md diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md deleted file mode 100644 index 657d32eca3ee99..00000000000000 --- a/InternalDocs/adaptive.md +++ /dev/null @@ -1,165 +0,0 @@ -# The specializing interpreter - -The specializing interpreter, which was introduced in -[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by -rewriting the bytecode based on runtime information. This is done by replacing -a generic instruction by a faster version that works for the case that this -program encounters. Each specializable instruction is responsible for rewriting -itself, using its [inline caches](interpreter.md#inline-cache-entries) for -bookkeeping. - -When a [`CodeObject`](code_objects.md) is created, the function -`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is -called to initialize the caches of all adaptive instructions. When an -adaptive instruction executes, it may attempt to specialize itself, -depending on the argument and the contents of its cache. This is done -by calling one of the `_Py_Specialize_XXX` functions in -[`Python/specialize.c`](../Python/specialize.c). - -The specialized instructions are responsible for checking that the special-case -assumptions still apply, and de-optimizing back to the generic version if not. - -# Adding or extending a family of adaptive instructions. - -## Families of instructions - -A *family* of instructions consists of an adaptive instruction along with the -specialized instruction that it can be replaced by. -It has the following fundamental properties: - -* It corresponds to a single instruction in the code - generated by the bytecode compiler. -* It has a single adaptive instruction that records an execution count and, - at regular intervals, attempts to specialize itself. If not specializing, - it executes the base implementation. -* It has at least one specialized form of the instruction that is tailored - for a particular value or set of values at runtime. -* All members of the family must have the same number of inline cache entries, - to ensure correct execution. - Individual family members do not need to use all of the entries, - but must skip over any unused entries when executing. - -The current implementation also requires the following, -although these are not fundamental and may change: - -* All families use one or more inline cache entries, - the first entry is always the counter. -* All instruction names should start with the name of the adaptive - instruction. -* Specialized forms should have names describing their specialization. - -## Example family - -The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c)) -already has an adaptive family that serves as a relatively simple example. - -The `LOAD_GLOBAL` instruction performs adaptive specialization, -calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero. - -There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE` -which is specialized for global variables in the module, and -`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables. - -## Performance analysis - -The benefit of a specialization can be assessed with the following formula: -`Tbase/Tadaptive`. - -Where `Tbase` is the mean time to execute the base instruction, -and `Tadaptive` is the mean time to execute the specialized and adaptive forms. - -`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)` - -`Ti` is the time to execute the `i`th instruction in the family and `Ni` is -the number of times that instruction is executed. -`Tmiss` is the time to process a miss, including de-optimzation -and the time to execute the base instruction. - -The ideal situation is where misses are rare and the specialized -forms are much faster than the base instruction. -`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`. -In which case we have `Tadaptive ≈ sum(Ti*Ni)`. -Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and -`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, -we would expect the specialization of `LOAD_GLOBAL` to be profitable. - -## Design considerations - -While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and -`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti` -low for all specialized instructions and `Nmiss` as low as possible. - -Keeping `Nmiss` low means that there should be specializations for almost -all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means -keeping `Ti` low which means minimizing branches and dependent memory -accesses (pointer chasing). These two objectives may be in conflict, -requiring judgement and experimentation to design the family of instructions. - -The size of the inline cache should as small as possible, -without impairing performance, to reduce the number of -`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache. - -### Gathering data - -Before choosing how to specialize an instruction, it is important to gather -some data. What are the patterns of usage of the base instruction? -Data can best be gathered by instrumenting the interpreter. Since a -specialization function and adaptive instruction are going to be required, -instrumentation can most easily be added in the specialization function. - -### Choice of specializations - -The performance of the specializing adaptive interpreter relies on the -quality of specialization and keeping the overhead of specialization low. - -Specialized instructions must be fast. In order to be fast, -specialized instructions should be tailored for a particular -set of values that allows them to: - -1. Verify that incoming value is part of that set with low overhead. -2. Perform the operation quickly. - -This requires that the set of values is chosen such that membership can be -tested quickly and that membership is sufficient to allow the operation to -performed quickly. - -For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` -dictionaries that have a keys with the expected version. - -This can be tested quickly: - -* `globals->keys->dk_version == expected_version` - -and the operation can be performed quickly: - -* `value = entries[cache->index].me_value;`. - -Because it is impossible to measure the performance of an instruction without -also measuring unrelated factors, the assessment of the quality of a -specialization will require some judgement. - -As a general rule, specialized instructions should be much faster than the -base instruction. - -### Implementation of specialized instructions - -In general, specialized instructions should be implemented in two parts: - -1. A sequence of guards, each of the form - `DEOPT_IF(guard-condition-is-false, BASE_NAME)`. -2. The operation, which should ideally have no branches and - a minimum number of dependent memory accesses. - -In practice, the parts may overlap, as data required for guards -can be re-used in the operation. - -If there are branches in the operation, then consider further specialization -to eliminate the branches. - -### Maintaining stats - -Finally, take care that stats are gathered correctly. -After the last `DEOPT_IF` has passed, a hit should be recorded with -`STAT_INC(BASE_INSTRUCTION, hit)`. -After an optimization has been deferred in the adaptive instruction, -that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`. From c4b288b2a4be7640058c698cacdb99bef690a16f Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 5 Dec 2024 21:30:08 +0000 Subject: [PATCH 09/12] mike's comments --- InternalDocs/code_objects.md | 10 +++++----- InternalDocs/interpreter.md | 2 +- InternalDocs/jit.md | 8 ++++---- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index ba7cab210fc087..a91a7043c1b8d4 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -18,11 +18,11 @@ Code objects are typically produced by the bytecode [compiler](compiler.md), although they are often written to disk by one process and read back in by another. The disk version of a code object is serialized using the [marshal](https://docs.python.org/dev/library/marshal.html) protocol. -When a [`CodeObject`](code_objects.md) is created, the function -`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is -called to initialize the caches of all adaptive instructions. This is -required because the on-disk format is a sequence of bytes, and -some of the caches need to be initialized with 16-bit values. +When a `CodeObject` is created, the function `_PyCode_Quicken()` from +[`Python/specialize.c`](../Python/specialize.c) is called to initialize +the caches of all adaptive instructions. This is required because the +on-disk format is a sequence of bytes, and some of the caches need to be +initialized with 16-bit values. Code objects are nominally immutable. Some fields (including `co_code_adaptive` and fields for runtime diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index de24e78fcaa307..fa4a54fdc54fac 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -366,7 +366,7 @@ assumptions still apply, and de-optimizing back to the generic version if not. ## Families of instructions A *family* of instructions consists of an adaptive instruction along with the -specialized instruction that it can be replaced by. +specialized instructions that it can be replaced by. It has the following fundamental properties: * It corresponds to a single instruction in the code diff --git a/InternalDocs/jit.md b/InternalDocs/jit.md index 6ac4a26a0e6965..8e577393b0302c 100644 --- a/InternalDocs/jit.md +++ b/InternalDocs/jit.md @@ -86,7 +86,7 @@ have changed. When the full jit is enabled (python was configured with [`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit), the uop executor's `jit_code` field is populated with a pointer to a compiled -C function that implement the executor logic. This function's signature is +C function that implements the executor logic. This function's signature is defined by `jit_func` in [`pycore_jit.h`](Include/internal/pycore_jit.h). When the executor is invoked by `ENTER_EXECUTOR`, instead of jumping to the uop interpreter at `tier2_dispatch`, the executor runs the function @@ -101,8 +101,8 @@ of the micro ops, which are completed with runtime information while the jitted code is constructed for an executor by [`_PyJIT_Compile`](../Python/jit.c). -The stencils are generated under the build target `regen-jit` by the scripts -in [`/Tools/jit`](/Tools/jit). This script reads +The stencils are generated at build time under the Makefile target `regen-jit` +by the scripts in [`/Tools/jit`](/Tools/jit). This script reads [`Python/executor_cases.c.h`](../Python/executor_cases.c.h) (which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). For each opcode, it constructs a `.c` file that contains a function for @@ -110,7 +110,7 @@ implementing this opcode, with some runtime information injected. This is done by replacing `CASE` by the bytecode definition in the template file [`Tools/jit/template.c`](../Tools/jit/template.c). -Each of the `.c` file is compiled by LLVM, to produce an object file +Each of the `.c` files is compiled by LLVM, to produce an object file that contains a function that executes the opcode. These compiled functions are used to generate the file [`jit_stencils.h`](../jit_stencils.h), which contains the functions From aced06a9dca0188b0148c5382ee95b0e6411748c Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Fri, 6 Dec 2024 14:40:36 +0000 Subject: [PATCH 10/12] Apply suggestions from code review Co-authored-by: Mark Shannon --- InternalDocs/jit.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/InternalDocs/jit.md b/InternalDocs/jit.md index 8e577393b0302c..e1308806a96f06 100644 --- a/InternalDocs/jit.md +++ b/InternalDocs/jit.md @@ -42,7 +42,7 @@ The optimizer that `_PyOptimizer_Optimize()` runs is configurable via the `_Py_SetTier2Optimizer()` function (this is used in test via `_testinternalcapi.set_optimizer()`.) -The micro-op optimizer (abbreviated `uop` to approximate `μop`) is defined in +The micro-op (abbreviated `uop` to approximate `μop`) optimizer is defined in [`Python/optimizer.c`](../Python/optimizer.c) as the type `_PyUOpOptimizer_Type`. It translates an instruction trace into a sequence of micro-ops by replacing each bytecode by an equivalent sequence of micro-ops (see @@ -52,9 +52,9 @@ which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). The micro-op sequence is then optimized by `_Py_uop_analyze_and_optimize` in [`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c) -and a `_PyUOpExecutor_Type` is created to contain it. +and an instance of `_PyUOpExecutor_Type` is created to contain it. -## Debugging a uop executor in the JIT interpreter +## The JIT interpreter After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro. From bc8a50bf7eed9f0b49e729ffd5981dda2186a32b Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Fri, 6 Dec 2024 15:14:04 +0000 Subject: [PATCH 11/12] interpreter jit is for debugging --- InternalDocs/jit.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/InternalDocs/jit.md b/InternalDocs/jit.md index e1308806a96f06..5b87226b073a7c 100644 --- a/InternalDocs/jit.md +++ b/InternalDocs/jit.md @@ -59,18 +59,24 @@ and an instance of `_PyUOpExecutor_Type` is created to contain it. After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro. -When the JIT is configured to run on its interpreter (i.e., python is -configured with -[`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)), -the executor jumps to `tier2_dispatch:` in +CPython implements two executors. Here we describe the JIT interpreter, +which is the simpler of them and is therefore useful for debugging and analyzing +the uops generation and optimization stages. To run it, we configure the +JIT to run on its interpreter (i.e., python is configured with +[`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)). + +When invoked, the executor jumps to the `tier2_dispatch:` label in [`Python/ceval.c`](../Python/ceval.c), where there is a loop that -executes the micro-ops. The micro-ops are are defined in -[`Python/executor_cases.c.h`](../Python/executor_cases.c.h), +executes the micro-ops. The body of this loop is a switch statement over +the uops IDs, reselmbling the one used in the adaptive interpreter. + +The swtich implementing the uops is in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h), which is generated by the build script [`Tools/cases_generator/tier2_generator.py`](../Tools/cases_generator/tier2_generator.py) from the bytecode definitions in [`Python/bytecodes.c`](../Python/bytecodes.c). -This loop exits when an `_EXIT_TRACE` or `_DEOPT` uop is reached, + +When an `_EXIT_TRACE` or `_DEOPT` uop is reached, the uop interpreter exits and execution returns to the adaptive interpreter. ## Invalidating Executors From 3bbf6ee453e353ce310577c70debdf651b0e4b31 Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Fri, 6 Dec 2024 16:28:27 +0000 Subject: [PATCH 12/12] Update InternalDocs/jit.md Co-authored-by: Mark Shannon --- InternalDocs/jit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/InternalDocs/jit.md b/InternalDocs/jit.md index 5b87226b073a7c..1e9f385d5f87fa 100644 --- a/InternalDocs/jit.md +++ b/InternalDocs/jit.md @@ -68,7 +68,7 @@ JIT to run on its interpreter (i.e., python is configured with When invoked, the executor jumps to the `tier2_dispatch:` label in [`Python/ceval.c`](../Python/ceval.c), where there is a loop that executes the micro-ops. The body of this loop is a switch statement over -the uops IDs, reselmbling the one used in the adaptive interpreter. +the uops IDs, resembling the one used in the adaptive interpreter. The swtich implementing the uops is in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h), which is generated by the build script