-
-
Notifications
You must be signed in to change notification settings - Fork 32k
HTMLParser raises exception on some inputs #77057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I noticed that the HTMLParser will raise an exception on some inputs. Here's a minified example: #!/usr/bin/env python3
import html.parser
html.parser.HTMLParser().feed("<![\n") However I actually stepped upon HTML failing on a real webpage: Exception of minified example: Traceback (most recent call last):
File "./foo.py", line 5, in <module>
html.parser.HTMLParser().feed("<![\n")
File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
self.goahead(0)
File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error() |
The stdlib HTML parser requires correct HTML. To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can. I doubt the stdlib will ever compete with BeautifulSoup. |
Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".) |
The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug. |
bpo-34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this. |
There are at least a couple of issues here. The first one is the way the parser handles '<![...'. The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised.
The above example should therefore fall into 4), and be treated like a bogus comment. PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment). The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n"). In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted. I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue). |
I get a different error now: >>> import html.parser
>>> html.parser.HTMLParser().feed("<![\n")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 110, in feed
self.goahead(0)
^^^^^^^^^^^^^^^
File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 178, in goahead
k = self.parse_html_declaration(i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 263, in parse_html_declaration
return self.parse_marked_section(i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 144, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 390, in _scan_name
raise AssertionError(
^^^^^^^^^^^^^^^^^^^^^
AssertionError: expected name token at '<![\n' |
The error() method was removed in bpo-31844. |
Now the example code raises an AssertionError(). Is that intended? I don't think that's any better. I usually wouldn't expect an HTML parser to raise any error if you pass it a string, but instead to do fault tolerant parsing. And if it's expected that some inputs can generate exceptions, at least I think this should be properly documented. |
Reopening to discuss what the correct behaviour should be. |
…H-9295) Co-authored-by: Serhiy Storchaka <[email protected]>
…rser (pythonGH-9295) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…rser (pythonGH-9295) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
@ezio-melotti's PR has been merged. Truncated comment, doctype declaration or CDATA section now produce a |
…arser (GH-9295) (GH-133834) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…arser (GH-9295) (GH-133833) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
CPython fixed python/cpython#77057 and backported this to >= 3.13.4 This recently show up in Debian unstable, when we uploaded the latest 3.13 snapshot.
Uh oh!
There was an error while loading. Please reload this page.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: