Skip to content

gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text #111007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sorgloomer
Copy link
Contributor

@sorgloomer sorgloomer commented Oct 17, 2023

@ghost
Copy link

ghost commented Oct 17, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@sorgloomer sorgloomer force-pushed the issue-110913-windows-console-utf8-boundary branch from b8a5110 to b66a4e4 Compare October 17, 2023 23:43
@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@sorgloomer sorgloomer force-pushed the issue-110913-windows-console-utf8-boundary branch from b66a4e4 to 709fb94 Compare October 17, 2023 23:45
@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@sorgloomer sorgloomer force-pushed the issue-110913-windows-console-utf8-boundary branch from 709fb94 to 2be4b92 Compare October 17, 2023 23:46
@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@sorgloomer sorgloomer force-pushed the issue-110913-windows-console-utf8-boundary branch from 2be4b92 to 2b0bf5b Compare October 18, 2023 00:42
@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@zooba zooba added needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Oct 18, 2023
@zooba
Copy link
Member

zooba commented Oct 18, 2023

I think we should have a NEWS entry for this, just to acknowledge that we changed the fix. Something like WindowsConsoleIO now correctly chunks large buffers without splitting up UTF-8 sequences

The change looks fine to me, but I'd like one more ACK before merging in case my head still isn't clear.

@serhiy-storchaka serhiy-storchaka self-requested a review October 19, 2023 21:29
@sorgloomer
Copy link
Contributor Author

Thanks! Added the NEWS entry.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible to write tests for this?

@zooba
Copy link
Member

zooba commented Oct 19, 2023

Is it even possible to write tests for this?

Not without refactoring the entire thing (e.g. to make the actual WriteConsoleW call encoding-unaware and create a specific TextIOWrapper just for this case that does the chunking). Or to make it possible to require a WindowsConsoleIO instance to talk to a file (which then could not use WriteConsoleW I believe, though it may be possible to trick it).

It's a harmless enough issue that I think we're okay. Misunderstanding the UTF-8 format is what the problem was this time, and that's been triple checked now.

@serhiy-storchaka
Copy link
Member

I'm just wondering if it's possible to get invalid UTF-8 here (using "surrogateescape"?). If yes, then we should use more sophisticated way to find the longest valid prefix of invalid sequence to meet recomendations for handling invalid UTF-8. Otherwise LGTM.

@zooba
Copy link
Member

zooba commented Oct 20, 2023

If it's undecodable by MultiByteToWideChar at this point, we'll have raised an error and never entered into this branch (wlen will be 0). And even if it did, it wouldn't be rendered correctly anyway and we aren't really making things worse by rendering it slightly more incorrectly.

Thanks for the review!

@zooba zooba merged commit 11312ea into python:main Oct 20, 2023
@miss-islington-app
Copy link

Thanks @sorgloomer for the PR, and @zooba for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 20, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 20, 2023
@bedevere-app
Copy link

bedevere-app bot commented Oct 20, 2023

GH-111108 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Oct 20, 2023
@bedevere-app
Copy link

bedevere-app bot commented Oct 20, 2023

GH-111109 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Oct 20, 2023
zooba pushed a commit that referenced this pull request Oct 20, 2023
(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <[email protected]>
zooba pushed a commit that referenced this pull request Oct 20, 2023
(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <[email protected]>
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants