Skip to content

Commit 4092510

Browse files
[3.13] gh-122291: Intern latin-1 one-byte strings at startup (GH-122303) (GH-122347)
(cherry picked from commit bb09ba6) Co-authored-by: Petr Viktorin <[email protected]>
1 parent 6b9a5af commit 4092510

File tree

2 files changed

+40
-62
lines changed

2 files changed

+40
-62
lines changed

InternalDocs/string_interning.md

Lines changed: 31 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -8,51 +8,50 @@
88

99
This is used to optimize dict and attribute lookups, among other things.
1010

11-
Python uses three different mechanisms to intern strings:
11+
Python uses two different mechanisms to intern strings: singletons and
12+
dynamic interning.
1213

13-
- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros.
14-
These are statically allocated, and collected using `make regen-global-objects`
15-
(`Tools/build/generate_global_objects.py`), which generates code
16-
for declaration, initialization and finalization.
14+
## Singletons
1715

18-
The difference between the two kinds is not important. (A `_Py_ID` string is
19-
a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain
20-
non-identifier characters, so it needs a separate C-compatible name.)
16+
The 256 possible one-character latin-1 strings, which can be retrieved with
17+
`_Py_LATIN1_CHR(c)`, are stored in statically allocated arrays,
18+
`_PyRuntime.static_objects.strings.ascii` and
19+
`_PyRuntime.static_objects.strings.latin1`.
2120

22-
The empty string is in this category (as `_Py_STR(empty)`).
21+
Longer singleton strings are marked in C source with `_Py_ID` (if the string
22+
is a valid C identifier fragment) or `_Py_STR` (if it needs a separate
23+
C-compatible name.)
24+
These are also stored in statically allocated arrays.
25+
They are collected from CPython sources using `make regen-global-objects`
26+
(`Tools/build/generate_global_objects.py`), which generates code
27+
for declaration, initialization and finalization.
2328

24-
These singletons are interned in a runtime-global lookup table,
25-
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
26-
at runtime initialization.
29+
The empty string is one of the singletons: `_Py_STR(empty)`.
2730

28-
- The 256 possible one-character latin-1 strings are singletons,
29-
which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global
30-
arrays, `_PyRuntime.static_objects.strings.ascii` and
31-
`_PyRuntime.static_objects.strings.latin1`.
31+
The three sets of singletons (`_Py_LATIN1_CHR`, `_Py_ID`, `_Py_STR`)
32+
are disjoint.
33+
If you have such a singleton, it (and no other copy) will be interned.
3234

33-
These are NOT interned at startup in the normal build.
34-
In the free-threaded build, they are; this avoids modifying the
35-
global lookup table after threads are started.
35+
These singletons are interned in a runtime-global lookup table,
36+
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
37+
at runtime initialization, and immutable until it's torn down
38+
at runtime finalization.
39+
It is shared across threads and interpreters without any synchronization.
3640

37-
Interning a one-char latin-1 string will always intern the corresponding
38-
singleton.
3941

40-
- All other strings are allocated dynamically, and have their
41-
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
42-
When interned, such strings are added to an interpreter-wide dict,
43-
`PyInterpreterState.cached_objects.interned_strings`.
42+
## Dynamically allocated strings
4443

45-
The key and value of each entry in this dict reference the same object.
44+
All other strings are allocated dynamically, and have their
45+
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
46+
When interned, such strings are added to an interpreter-wide dict,
47+
`PyInterpreterState.cached_objects.interned_strings`.
4648

47-
The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`)
48-
are disjoint.
49-
If you have such a singleton, it (and no other copy) will be interned.
49+
The key and value of each entry in this dict reference the same object.
5050

5151

5252
## Immortality and reference counting
5353

54-
Invariant: Every immortal string is interned, *except* the one-char latin-1
55-
singletons (which might but might not be interned).
54+
Invariant: Every immortal string is interned.
5655

5756
In practice, this means that you must not use `_Py_SetImmortal` on
5857
a string. (If you know it's already immortal, don't immortalize it;
@@ -115,8 +114,5 @@ The valid transitions between these states are:
115114
Using `_PyUnicode_InternStatic` on these is an error; the other cases
116115
don't change the state.
117116

118-
- One-char latin-1 singletons can be interned (0 -> 3) using any interning
119-
function; after that the functions don't change the state.
120-
121-
- Other statically allocated strings are interned (0 -> 3) at runtime init;
117+
- Singletons are interned (0 -> 3) at runtime init;
122118
after that all interning functions don't change the state.

Objects/unicodeobject.c

Lines changed: 9 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -320,22 +320,20 @@ init_global_interned_strings(PyInterpreterState *interp)
320320
return _PyStatus_ERR("failed to create global interned dict");
321321
}
322322

323-
/* Intern statically allocated string identifiers and deepfreeze strings.
323+
/* Intern statically allocated string identifiers, deepfreeze strings,
324+
* and one-byte latin-1 strings.
324325
* This must be done before any module initialization so that statically
325326
* allocated string identifiers are used instead of heap allocated strings.
326327
* Deepfreeze uses the interned identifiers if present to save space
327328
* else generates them and they are interned to speed up dict lookups.
328329
*/
329330
_PyUnicode_InitStaticStrings(interp);
330331

331-
#ifdef Py_GIL_DISABLED
332-
// In the free-threaded build, intern the 1-byte strings as well
333332
for (int i = 0; i < 256; i++) {
334333
PyObject *s = LATIN1(i);
335334
_PyUnicode_InternStatic(interp, &s);
336335
assert(s == LATIN1(i));
337336
}
338-
#endif
339337
#ifdef Py_DEBUG
340338
assert(_PyUnicode_CheckConsistency(&_Py_STR(empty), 1));
341339

@@ -15051,26 +15049,14 @@ intern_static(PyInterpreterState *interp, PyObject *s /* stolen */)
1505115049
assert(s != NULL);
1505215050
assert(_PyUnicode_CHECK(s));
1505315051
assert(_PyUnicode_STATE(s).statically_allocated);
15054-
15055-
switch (PyUnicode_CHECK_INTERNED(s)) {
15056-
case SSTATE_NOT_INTERNED:
15057-
break;
15058-
case SSTATE_INTERNED_IMMORTAL_STATIC:
15059-
return s;
15060-
default:
15061-
Py_FatalError("_PyUnicode_InternStatic called on wrong string");
15062-
}
15052+
assert(!PyUnicode_CHECK_INTERNED(s));
1506315053

1506415054
#ifdef Py_DEBUG
1506515055
/* We must not add process-global interned string if there's already a
1506615056
* per-interpreter interned_dict, which might contain duplicates.
15067-
* Except "short string" singletons: those are special-cased. */
15057+
*/
1506815058
PyObject *interned = get_interned_dict(interp);
15069-
assert(interned == NULL || unicode_is_singleton(s));
15070-
#ifdef Py_GIL_DISABLED
15071-
// In the free-threaded build, don't allow even the short strings.
1507215059
assert(interned == NULL);
15073-
#endif
1507415060
#endif
1507515061

1507615062
/* Look in the global cache first. */
@@ -15142,11 +15128,6 @@ intern_common(PyInterpreterState *interp, PyObject *s /* stolen */,
1514215128
return s;
1514315129
}
1514415130

15145-
/* Handle statically allocated strings. */
15146-
if (_PyUnicode_STATE(s).statically_allocated) {
15147-
return intern_static(interp, s);
15148-
}
15149-
1515015131
/* Is it already interned? */
1515115132
switch (PyUnicode_CHECK_INTERNED(s)) {
1515215133
case SSTATE_NOT_INTERNED:
@@ -15163,6 +15144,9 @@ intern_common(PyInterpreterState *interp, PyObject *s /* stolen */,
1516315144
return s;
1516415145
}
1516515146

15147+
/* Statically allocated strings must be already interned. */
15148+
assert(!_PyUnicode_STATE(s).statically_allocated);
15149+
1516615150
#if Py_GIL_DISABLED
1516715151
/* In the free-threaded build, all interned strings are immortal */
1516815152
immortalize = 1;
@@ -15173,13 +15157,11 @@ intern_common(PyInterpreterState *interp, PyObject *s /* stolen */,
1517315157
immortalize = 1;
1517415158
}
1517515159

15176-
/* if it's a short string, get the singleton -- and intern it */
15160+
/* if it's a short string, get the singleton */
1517715161
if (PyUnicode_GET_LENGTH(s) == 1 &&
1517815162
PyUnicode_KIND(s) == PyUnicode_1BYTE_KIND) {
1517915163
PyObject *r = LATIN1(*(unsigned char*)PyUnicode_DATA(s));
15180-
if (!PyUnicode_CHECK_INTERNED(r)) {
15181-
r = intern_static(interp, r);
15182-
}
15164+
assert(PyUnicode_CHECK_INTERNED(r));
1518315165
Py_DECREF(s);
1518415166
return r;
1518515167
}

0 commit comments

Comments
 (0)