Html Reader Preserve Unicode Whitespace Characters #4106
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #1284, which was closed as stale in 2019, but which I will now reopen. Html Reader converts Unicode whitespace characters in a DOM text node to space. However, Html treats only space, tab, CR, LF, vertical tab, and form-feed as whitespace. Using a regular expression with the
u
(Unicode) modifier causes a number of other characters to be converted to space inappropriately. The issue mentions "ideographic space" in particular, stating that it is used for formatting and should be preserved. "Non-breaking space" is also used in the same way and should also be preserved. An exception is made for a text node consisting of a single non-breaking space, since that is used as a placeholder by Html Writer; my own guess is that this is the reason why the Unicode modifier was used in the first place.This is:
Checklist:
Why this change is needed?
Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.