Skip to content

Docs: C API: Clarify what happens when null bytes are passed to PyUnicode_AsUTF8 #127458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 20, 2025

Conversation

ZeroIntensity
Copy link
Member

@ZeroIntensity ZeroIntensity commented Nov 30, 2024

Per discussion from a few days ago, PyUnicode_AsUTF8 is missing documentation for the embedded null character gotcha. There was some pushback about marking it as a warning, so I've left it as a note for now.


📚 Documentation preview 📚: https://cpython-previews--127458.org.readthedocs.build/

Copy link
Contributor

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change to wording

@TeamSpen210
Copy link

I'm not sure if it's good to say that the function "doesn't handle" null bytes - the bytes are there and accessible, the problem is that strlen() isn't able to handle them. Maybe it'd be better to say because strings can contain null bytes, strlen() should not be used, you should fetch that from the object?

@ZeroIntensity
Copy link
Member Author

Yeah, "doesn't handle" as in doesn't raise an error or strip them. I don't think it's worth changing the wording, but I'll let others weigh in.

@ZeroIntensity ZeroIntensity changed the title Docs: Clarify what happens when null bytes are passed to PyUnicode_AsUTF8 Docs: C API: Clarify what happens when null bytes are passed to PyUnicode_AsUTF8 Dec 16, 2024
`null bytes <https://en.wikipedia.org/wiki/Null_character>`_ embedded within
*unicode*. As a result, strings containing null bytes will remain in the returned
string, which some C functions might interpret as the end of the string, leading to
truncation. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that "user input" is useful here. I would prefer to say something like: "If truncation is an issue, ..."

@@ -1035,6 +1035,15 @@ These are the UTF-8 codec APIs:

As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.

.. warning::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning is a strong signal. Maybe a note is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it as a note, but I think a warning is what we want here. I only discovered this quirk of PyUnicode_AsUTF8 because it caused a crash in the _interpreters module. Things that can potentially cause security issues should probably get a warning, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that notes about security are usually documented in red as warnings.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZeroIntensity
Copy link
Member Author

Friendly ping @vstinner :)

@vstinner vstinner merged commit e792f4b into python:main Jan 20, 2025
29 checks passed
@vstinner
Copy link
Member

Merged, thank you.

@ZeroIntensity
Copy link
Member Author

How do we feel about backporting?

@ZeroIntensity ZeroIntensity deleted the asutf8-null-chars branch January 20, 2025 15:55
@vstinner vstinner added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Jan 20, 2025
@miss-islington-app
Copy link

Thanks @ZeroIntensity for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Thanks @ZeroIntensity for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 20, 2025
…code_AsUTF8` (pythonGH-127458)

(cherry picked from commit e792f4b)

Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Stan U. <[email protected]>
Co-authored-by: Tomas R. <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jan 20, 2025

GH-129080 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jan 20, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 20, 2025
…code_AsUTF8` (pythonGH-127458)

(cherry picked from commit e792f4b)

Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Stan U. <[email protected]>
Co-authored-by: Tomas R. <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jan 20, 2025

GH-129081 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Jan 20, 2025
@vstinner
Copy link
Member

How do we feel about backporting?

I backported the change to 3.12 and 3.13 branches.

vstinner added a commit that referenced this pull request Jan 20, 2025
… `PyUnicode_AsUTF8` (GH-127458) (#129080)

Docs C API: Clarify what happens when null bytes are passed to `PyUnicode_AsUTF8` (GH-127458)
(cherry picked from commit e792f4b)

Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Stan U. <[email protected]>
Co-authored-by: Tomas R. <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
vstinner added a commit that referenced this pull request Jan 20, 2025
… `PyUnicode_AsUTF8` (GH-127458) (#129081)

Docs C API: Clarify what happens when null bytes are passed to `PyUnicode_AsUTF8` (GH-127458)
(cherry picked from commit e792f4b)

Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Stan U. <[email protected]>
Co-authored-by: Tomas R. <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 21, 2025
…code_AsUTF8` (python#127458)

Co-authored-by: Stan U. <[email protected]>
Co-authored-by: Tomas R. <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants