Skip to content

Commit 1d4a531

Browse files
committed
[3.12] pythongh-122291: Intern latin-1 one-byte strings at startup (pythonGH-122303)
1 parent 18b9ade commit 1d4a531

File tree

2 files changed

+47
-55
lines changed

2 files changed

+47
-55
lines changed

InternalDocs/string_interning.md

Lines changed: 31 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -8,49 +8,50 @@
88

99
This is used to optimize dict and attribute lookups, among other things.
1010

11-
Python uses three different mechanisms to intern strings:
11+
Python uses two different mechanisms to intern strings: singletons and
12+
dynamic interning.
1213

13-
- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros.
14-
These are statically allocated, and collected using `make regen-global-objects`
15-
(`Tools/build/generate_global_objects.py`), which generates code
16-
for declaration, initialization and finalization.
14+
## Singletons
1715

18-
The difference between the two kinds is not important. (A `_Py_ID` string is
19-
a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain
20-
non-identifier characters, so it needs a separate C-compatible name.)
16+
The 256 possible one-character latin-1 strings, which can be retrieved with
17+
`_Py_LATIN1_CHR(c)`, are stored in statically allocated arrays,
18+
`_PyRuntime.static_objects.strings.ascii` and
19+
`_PyRuntime.static_objects.strings.latin1`.
2120

22-
The empty string is in this category (as `_Py_STR(empty)`).
21+
Longer singleton strings are marked in C source with `_Py_ID` (if the string
22+
is a valid C identifier fragment) or `_Py_STR` (if it needs a separate
23+
C-compatible name.)
24+
These are also stored in statically allocated arrays.
25+
They are collected from CPython sources using `make regen-global-objects`
26+
(`Tools/build/generate_global_objects.py`), which generates code
27+
for declaration, initialization and finalization.
2328

24-
These singletons are interned in a runtime-global lookup table,
25-
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
26-
at runtime initialization.
29+
The empty string is one of the singletons: `_Py_STR(empty)`.
2730

28-
- The 256 possible one-character latin-1 strings are singletons,
29-
which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global
30-
arrays, `_PyRuntime.static_objects.strings.ascii` and
31-
`_PyRuntime.static_objects.strings.latin1`.
31+
The three sets of singletons (`_Py_LATIN1_CHR`, `_Py_ID`, `_Py_STR`)
32+
are disjoint.
33+
If you have such a singleton, it (and no other copy) will be interned.
3234

33-
These are NOT interned at startup in the normal build.
35+
These singletons are interned in a runtime-global lookup table,
36+
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
37+
at runtime initialization, and immutable until it's torn down
38+
at runtime finalization.
39+
It is shared across threads and interpreters without any synchronization.
3440

35-
Interning a one-char latin-1 string will always intern the corresponding
36-
singleton.
3741

38-
- All other strings are allocated dynamically, and have their
39-
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
40-
When interned, such strings are added to an interpreter-wide dict,
41-
`PyInterpreterState.cached_objects.interned_strings`.
42+
## Dynamically allocated strings
4243

43-
The key and value of each entry in this dict reference the same object.
44+
All other strings are allocated dynamically, and have their
45+
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
46+
When interned, such strings are added to an interpreter-wide dict,
47+
`PyInterpreterState.cached_objects.interned_strings`.
4448

45-
The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`)
46-
are disjoint.
47-
If you have such a singleton, it (and no other copy) will be interned.
49+
The key and value of each entry in this dict reference the same object.
4850

4951

5052
## Immortality and reference counting
5153

52-
Invariant: Every immortal string is interned, *except* the one-char latin-1
53-
singletons (which might but might not be interned).
54+
Invariant: Every immortal string is interned.
5455

5556
In practice, this means that you must not use `_Py_SetImmortal` on
5657
a string. (If you know it's already immortal, don't immortalize it;
@@ -113,8 +114,5 @@ The valid transitions between these states are:
113114
Using `_PyUnicode_InternStatic` on these is an error; the other cases
114115
don't change the state.
115116

116-
- One-char latin-1 singletons can be interned (0 -> 3) using any interning
117-
function; after that the functions don't change the state.
118-
119-
- Other statically allocated strings are interned (0 -> 3) at runtime init;
117+
- Singletons are interned (0 -> 3) at runtime init;
120118
after that all interning functions don't change the state.

Objects/unicodeobject.c

Lines changed: 16 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -329,14 +329,20 @@ init_global_interned_strings(PyInterpreterState *interp)
329329
return _PyStatus_ERR("failed to create global interned dict");
330330
}
331331

332-
/* Intern statically allocated string identifiers and deepfreeze strings.
332+
/* Intern statically allocated string identifiers, deepfreeze strings,
333+
* and one-byte latin-1 strings.
333334
* This must be done before any module initialization so that statically
334335
* allocated string identifiers are used instead of heap allocated strings.
335336
* Deepfreeze uses the interned identifiers if present to save space
336337
* else generates them and they are interned to speed up dict lookups.
337338
*/
338339
_PyUnicode_InitStaticStrings(interp);
339340

341+
for (int i = 0; i < 256; i++) {
342+
PyObject *s = LATIN1(i);
343+
_PyUnicode_InternStatic(interp, &s);
344+
assert(s == LATIN1(i));
345+
}
340346
#ifdef Py_DEBUG
341347
assert(_PyUnicode_CheckConsistency(&_Py_STR(empty), 1));
342348

@@ -14889,23 +14895,14 @@ intern_static(PyInterpreterState *interp, PyObject *s /* stolen */)
1488914895
assert(s != NULL);
1489014896
assert(_PyUnicode_CHECK(s));
1489114897
assert(_PyUnicode_STATE(s).statically_allocated);
14892-
assert(_Py_IsImmortal(s));
14893-
14894-
switch (PyUnicode_CHECK_INTERNED(s)) {
14895-
case SSTATE_NOT_INTERNED:
14896-
break;
14897-
case SSTATE_INTERNED_IMMORTAL_STATIC:
14898-
return s;
14899-
default:
14900-
Py_FatalError("_PyUnicode_InternStatic called on wrong string");
14901-
}
14898+
assert(!PyUnicode_CHECK_INTERNED(s));
1490214899

1490314900
#ifdef Py_DEBUG
1490414901
/* We must not add process-global interned string if there's already a
1490514902
* per-interpreter interned_dict, which might contain duplicates.
14906-
* Except "short string" singletons: those are special-cased. */
14903+
*/
1490714904
PyObject *interned = get_interned_dict(interp);
14908-
assert(interned == NULL || unicode_is_singleton(s));
14905+
assert(interned == NULL);
1490914906
#endif
1491014907

1491114908
/* Look in the global cache first. */
@@ -14977,11 +14974,6 @@ intern_common(PyInterpreterState *interp, PyObject *s /* stolen */,
1497714974
return s;
1497814975
}
1497914976

14980-
/* Handle statically allocated strings. */
14981-
if (_PyUnicode_STATE(s).statically_allocated) {
14982-
return intern_static(interp, s);
14983-
}
14984-
1498514977
/* Is it already interned? */
1498614978
switch (PyUnicode_CHECK_INTERNED(s)) {
1498714979
case SSTATE_NOT_INTERNED:
@@ -14998,18 +14990,20 @@ intern_common(PyInterpreterState *interp, PyObject *s /* stolen */,
1499814990
return s;
1499914991
}
1500014992

14993+
if (_PyUnicode_STATE(s).statically_allocated) {
14994+
return intern_static(interp, s);
14995+
}
14996+
1500114997
/* If it's already immortal, intern it as such */
1500214998
if (_Py_IsImmortal(s)) {
1500314999
immortalize = 1;
1500415000
}
1500515001

15006-
/* if it's a short string, get the singleton -- and intern it */
15002+
/* if it's a short string, get the singleton */
1500715003
if (PyUnicode_GET_LENGTH(s) == 1 &&
1500815004
PyUnicode_KIND(s) == PyUnicode_1BYTE_KIND) {
1500915005
PyObject *r = LATIN1(*(unsigned char*)PyUnicode_DATA(s));
15010-
if (!PyUnicode_CHECK_INTERNED(r)) {
15011-
r = intern_static(interp, r);
15012-
}
15006+
assert(PyUnicode_CHECK_INTERNED(r));
1501315007
Py_DECREF(s);
1501415008
return r;
1501515009
}

0 commit comments

Comments
 (0)