Skip to content

Html Reader Preserve Unicode Whitespace Characters #4106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

oleibman
Copy link
Collaborator

Fix #1284, which was closed as stale in 2019, but which I will now reopen. Html Reader converts Unicode whitespace characters in a DOM text node to space. However, Html treats only space, tab, CR, LF, vertical tab, and form-feed as whitespace. Using a regular expression with the u (Unicode) modifier causes a number of other characters to be converted to space inappropriately. The issue mentions "ideographic space" in particular, stating that it is used for formatting and should be preserved. "Non-breaking space" is also used in the same way and should also be preserved. An exception is made for a text node consisting of a single non-breaking space, since that is used as a placeholder by Html Writer; my own guess is that this is the reason why the Unicode modifier was used in the first place.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#1284, which was closed as stale in 2019, but which I will now reopen. Html Reader converts *Unicode* whitespace characters in a DOM text node to space. However, Html treats only space, tab, CR, LF, vertical tab, and form-feed as whitespace. Using a regular expression with the `u` (Unicode) modifier causes a number of other characters to be converted to space inappropriately. The issue mentions "ideographic space" in particular, stating that it is used for formatting and should be preserved. "Non-breaking space" is also used in the same way and should also be preserved. An exception is made for a text node consisting of a single non-breaking space, since that is used as a placeholder by Html Writer; my own guess is that this is the reason why the Unicode modifier was used in the first place.
@oleibman oleibman added this pull request to the merge queue Jul 30, 2024
Merged via the queue into PHPOffice:master with commit ae2c3ea Jul 30, 2024
13 checks passed
@oleibman oleibman deleted the issue1284 branch November 10, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

I think that it must be replaced spaces without the Unicode option when outputting Excel.
1 participant