You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FixPHPOffice#3995. FixPHPOffice#866. FixPHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.
This will allow us to use:
- UTF-8 files or snippets without BOM, with or without charset
- UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
- UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
- all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.
We cannot use:
- UTF-16BE or UTF-16LE declared in a meta tag
- UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
- unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)
I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
0 commit comments