|
8 | 8 |
|
9 | 9 | This is used to optimize dict and attribute lookups, among other things.
|
10 | 10 |
|
11 |
| -Python uses three different mechanisms to intern strings: |
| 11 | +Python uses two different mechanisms to intern strings: singletons and |
| 12 | +dynamic interning. |
12 | 13 |
|
13 |
| -- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros. |
14 |
| - These are statically allocated, and collected using `make regen-global-objects` |
15 |
| - (`Tools/build/generate_global_objects.py`), which generates code |
16 |
| - for declaration, initialization and finalization. |
| 14 | +## Singletons |
17 | 15 |
|
18 |
| - The difference between the two kinds is not important. (A `_Py_ID` string is |
19 |
| - a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain |
20 |
| - non-identifier characters, so it needs a separate C-compatible name.) |
| 16 | +The 256 possible one-character latin-1 strings, which can be retrieved with |
| 17 | +`_Py_LATIN1_CHR(c)`, are stored in statically allocated arrays, |
| 18 | +`_PyRuntime.static_objects.strings.ascii` and |
| 19 | +`_PyRuntime.static_objects.strings.latin1`. |
21 | 20 |
|
22 |
| - The empty string is in this category (as `_Py_STR(empty)`). |
| 21 | +Longer singleton strings are marked in C source with `_Py_ID` (if the string |
| 22 | +is a valid C identifier fragment) or `_Py_STR` (if it needs a separate |
| 23 | +C-compatible name.) |
| 24 | +These are also stored in statically allocated arrays. |
| 25 | +They are collected from CPython sources using `make regen-global-objects` |
| 26 | +(`Tools/build/generate_global_objects.py`), which generates code |
| 27 | +for declaration, initialization and finalization. |
23 | 28 |
|
24 |
| - These singletons are interned in a runtime-global lookup table, |
25 |
| - `_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
26 |
| - at runtime initialization. |
| 29 | +The empty string is one of the singletons: `_Py_STR(empty)`. |
27 | 30 |
|
28 |
| -- The 256 possible one-character latin-1 strings are singletons, |
29 |
| - which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global |
30 |
| - arrays, `_PyRuntime.static_objects.strings.ascii` and |
31 |
| - `_PyRuntime.static_objects.strings.latin1`. |
| 31 | +The three sets of singletons (`_Py_LATIN1_CHR`, `_Py_ID`, `_Py_STR`) |
| 32 | +are disjoint. |
| 33 | +If you have such a singleton, it (and no other copy) will be interned. |
32 | 34 |
|
33 |
| - These are NOT interned at startup in the normal build. |
| 35 | +These singletons are interned in a runtime-global lookup table, |
| 36 | +`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
| 37 | +at runtime initialization, and immutable until it's torn down |
| 38 | +at runtime finalization. |
| 39 | +It is shared across threads and interpreters without any synchronization. |
34 | 40 |
|
35 |
| - Interning a one-char latin-1 string will always intern the corresponding |
36 |
| - singleton. |
37 | 41 |
|
38 |
| -- All other strings are allocated dynamically, and have their |
39 |
| - `_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
40 |
| - When interned, such strings are added to an interpreter-wide dict, |
41 |
| - `PyInterpreterState.cached_objects.interned_strings`. |
| 42 | +## Dynamically allocated strings |
42 | 43 |
|
43 |
| - The key and value of each entry in this dict reference the same object. |
| 44 | +All other strings are allocated dynamically, and have their |
| 45 | +`_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
| 46 | +When interned, such strings are added to an interpreter-wide dict, |
| 47 | +`PyInterpreterState.cached_objects.interned_strings`. |
44 | 48 |
|
45 |
| -The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`) |
46 |
| -are disjoint. |
47 |
| -If you have such a singleton, it (and no other copy) will be interned. |
| 49 | +The key and value of each entry in this dict reference the same object. |
48 | 50 |
|
49 | 51 |
|
50 | 52 | ## Immortality and reference counting
|
51 | 53 |
|
52 |
| -Invariant: Every immortal string is interned, *except* the one-char latin-1 |
53 |
| -singletons (which might but might not be interned). |
| 54 | +Invariant: Every immortal string is interned. |
54 | 55 |
|
55 | 56 | In practice, this means that you must not use `_Py_SetImmortal` on
|
56 | 57 | a string. (If you know it's already immortal, don't immortalize it;
|
@@ -113,8 +114,5 @@ The valid transitions between these states are:
|
113 | 114 | Using `_PyUnicode_InternStatic` on these is an error; the other cases
|
114 | 115 | don't change the state.
|
115 | 116 |
|
116 |
| -- One-char latin-1 singletons can be interned (0 -> 3) using any interning |
117 |
| - function; after that the functions don't change the state. |
118 |
| - |
119 |
| -- Other statically allocated strings are interned (0 -> 3) at runtime init; |
| 117 | +- Singletons are interned (0 -> 3) at runtime init; |
120 | 118 | after that all interning functions don't change the state.
|
0 commit comments