Skip to content

Commit 6f3024f

Browse files
vstinnerserhiy-storchaka
authored andcommitted
pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>
1 parent ab3fe5f commit 6f3024f

File tree

8 files changed

+49
-25
lines changed

8 files changed

+49
-25
lines changed

Doc/c-api/unicode.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -992,11 +992,19 @@ These are the UTF-8 codec APIs:
992992
993993
As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
994994
995+
Raise an exception if the *unicode* string contains embedded null
996+
characters. To accept embedded null characters and truncate on purpose
997+
at the first null byte, ``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be
998+
used instead.
999+
9951000
.. versionadded:: 3.3
9961001
9971002
.. versionchanged:: 3.7
9981003
The return type is now ``const char *`` rather of ``char *``.
9991004
1005+
.. versionchanged:: 3.13
1006+
Raise an exception if the string contains embedded null characters.
1007+
10001008
10011009
UTF-32 Codecs
10021010
"""""""""""""

Doc/whatsnew/3.13.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1109,6 +1109,12 @@ Porting to Python 3.13
11091109
are now undefined by ``<Python.h>``.
11101110
(Contributed by Victor Stinner in :gh:`85283`.)
11111111

1112+
* The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the string
1113+
contains embedded null characters. To accept embedded null characters and
1114+
truncate on purpose at the first null byte,
1115+
``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be used instead.
1116+
(Contributed by Victor Stinner in :gh:`111089`.)
1117+
11121118
Deprecated
11131119
----------
11141120

Include/cpython/unicodeobject.h

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -442,18 +442,18 @@ PyAPI_FUNC(PyObject*) PyUnicode_FromKindAndData(
442442

443443
/* --- Manage the default encoding ---------------------------------------- */
444444

445-
/* Returns a pointer to the default encoding (UTF-8) of the
446-
Unicode object unicode.
447-
448-
Like PyUnicode_AsUTF8AndSize(), this also caches the UTF-8 representation
449-
in the unicodeobject.
450-
451-
Use of this API is DEPRECATED since no size information can be
452-
extracted from the returned data.
453-
*/
454-
445+
// Returns a pointer to the default encoding (UTF-8) of the
446+
// Unicode object unicode.
447+
//
448+
// Raise an exception if the string contains embedded null characters.
449+
// Use PyUnicode_AsUTF8AndSize() to accept embedded null characters.
450+
//
451+
// This function caches the UTF-8 encoded string in the Unicode object
452+
// and subsequent calls will return the same string. The memory is released
453+
// when the Unicode object is deallocated.
455454
PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);
456455

456+
457457
/* === Characters Type APIs =============================================== */
458458

459459
/* These should not be used directly. Use the Py_UNICODE_IS* and

Include/unicodeobject.h

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -443,17 +443,15 @@ PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
443443
PyObject *unicode /* Unicode object */
444444
);
445445

446-
/* Returns a pointer to the default encoding (UTF-8) of the
447-
Unicode object unicode and the size of the encoded representation
448-
in bytes stored in *size.
449-
450-
In case of an error, no *size is set.
451-
452-
This function caches the UTF-8 encoded string in the unicodeobject
453-
and subsequent calls will return the same string. The memory is released
454-
when the unicodeobject is deallocated.
455-
*/
456-
446+
// Returns a pointer to the default encoding (UTF-8) of the
447+
// Unicode object unicode and the size of the encoded representation
448+
// in bytes stored in `*size` (if size is not NULL).
449+
//
450+
// On error, `*size` is set to 0 (if size is not NULL).
451+
//
452+
// This function caches the UTF-8 encoded string in the Unicode object
453+
// and subsequent calls will return the same string. The memory is released
454+
// when the Unicode object is deallocated.
457455
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030A0000
458456
PyAPI_FUNC(const char *) PyUnicode_AsUTF8AndSize(
459457
PyObject *unicode,

Lib/test/test_capi/test_unicode.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -882,7 +882,10 @@ def test_asutf8(self):
882882
self.assertEqual(unicode_asutf8('abc', 4), b'abc\0')
883883
self.assertEqual(unicode_asutf8('абв', 7), b'\xd0\xb0\xd0\xb1\xd0\xb2\0')
884884
self.assertEqual(unicode_asutf8('\U0001f600', 5), b'\xf0\x9f\x98\x80\0')
885-
self.assertEqual(unicode_asutf8('abc\0def', 8), b'abc\0def\0')
885+
886+
# disallow embedded null characters
887+
self.assertRaises(ValueError, unicode_asutf8, 'abc\0', 0)
888+
self.assertRaises(ValueError, unicode_asutf8, 'abc\0def', 0)
886889

887890
self.assertRaises(UnicodeEncodeError, unicode_asutf8, '\ud8ff', 0)
888891
self.assertRaises(TypeError, unicode_asutf8, b'abc', 0)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the
2+
string contains embedded null characters. Patch by Victor Stinner.

Objects/typeobject.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3499,13 +3499,14 @@ type_new_set_doc(PyTypeObject *type)
34993499
return 0;
35003500
}
35013501

3502-
const char *doc_str = PyUnicode_AsUTF8(doc);
3502+
Py_ssize_t doc_size;
3503+
const char *doc_str = PyUnicode_AsUTF8AndSize(doc, &doc_size);
35033504
if (doc_str == NULL) {
35043505
return -1;
35053506
}
35063507

35073508
// Silently truncate the docstring if it contains a null byte
3508-
Py_ssize_t size = strlen(doc_str) + 1;
3509+
Py_ssize_t size = doc_size + 1;
35093510
char *tp_doc = (char *)PyObject_Malloc(size);
35103511
if (tp_doc == NULL) {
35113512
PyErr_NoMemory();

Objects/unicodeobject.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3837,7 +3837,13 @@ PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *psize)
38373837
const char *
38383838
PyUnicode_AsUTF8(PyObject *unicode)
38393839
{
3840-
return PyUnicode_AsUTF8AndSize(unicode, NULL);
3840+
Py_ssize_t size;
3841+
const char *utf8 = PyUnicode_AsUTF8AndSize(unicode, &size);
3842+
if (utf8 != NULL && strlen(utf8) != (size_t)size) {
3843+
PyErr_SetString(PyExc_ValueError, "embedded null character");
3844+
return NULL;
3845+
}
3846+
return utf8;
38413847
}
38423848

38433849
/*

0 commit comments

Comments
 (0)