HTMLParser raises exception on some inputs #77057

hannob · 2018-02-19T19:52:16Z

BPO	32876
Nosy	@ezio-melotti, @stevendaprano, @berkerpeksag, @hannob, @iritkatriel
PRs	gh-77057: Fix handling of invalid markup declarations in HTMLParser #9295
Superseder	bpo-31844: HTMLParser: undocumented not implemented method

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = None
created_at = <Date 2018-02-19.19:52:16.326>
labels = ['type-bug', 'library', '3.11']
title = 'HTMLParser raises exception on some inputs'
updated_at = <Date 2022-01-14.14:32:47.114>
user = 'https://github.com/hannob'

bugs.python.org fields:

activity = <Date 2022-01-14.14:32:47.114>
actor = 'iritkatriel'
assignee = 'ezio.melotti'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-02-19.19:52:16.326>
creator = 'hanno'
dependencies = []
files = []
hgrepos = []
issue_num = 32876
keywords = ['patch']
message_count = 10.0
messages = ['312363', '312379', '312380', '312381', '323971', '325330', '401507', '410559', '410561', '410563']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'steven.daprano', 'berker.peksag', 'hanno', 'iritkatriel']
pr_nums = ['9295']
priority = 'normal'
resolution = None
stage = 'resolved'
status = 'open'
superseder = '31844'
type = 'behavior'
url = 'https://bugs.python.org/issue32876'
versions = ['Python 3.11']

Linked PRs

hannob · 2018-02-19T19:52:16Z

I noticed that the HTMLParser will raise an exception on some inputs.
I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

Here's a minified example:

#!/usr/bin/env python3
import html.parser
html.parser.HTMLParser().feed("<![\n")

However I actually stepped upon HTML failing on a real webpage:
https://kafanews.com/

Exception of minified example:

Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    html.parser.HTMLParser().feed("<![\n")
  File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()

stevendaprano · 2018-02-19T23:02:09Z

The stdlib HTML parser requires correct HTML.

To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can.

I doubt the stdlib will ever compete with BeautifulSoup.

hannob · 2018-02-19T23:05:01Z

Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)

ezio-melotti · 2018-02-19T23:07:21Z

The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.

berkerpeksag · 2018-08-23T18:31:53Z

bpo-34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this.

ezio-melotti · 2018-09-14T07:28:22Z

There are at least a couple of issues here.

The first one is the way the parser handles '<![...'. The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised.
However "8.2.4.42. Markup declaration open state"0, states that after consuming '<!', there are only 4 valid paths forward:

if we have '<!--', it's a comment;
if we have '<!doctype', it's a doctype declaration;
if we have '<![CDATA[', it's a CDATA section;
if it's something else, it's a bogus comment;

The above example should therefore fall into 4), and be treated like a bogus comment.

PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment).

The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n"). In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted. I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue).

iritkatriel · 2021-09-09T18:09:00Z

I get a different error now:

>>> import html.parser
>>> html.parser.HTMLParser().feed("<![\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 110, in feed
    self.goahead(0)
    ^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 178, in goahead
    k = self.parse_html_declaration(i)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 144, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 390, in _scan_name
    raise AssertionError(
    ^^^^^^^^^^^^^^^^^^^^^
AssertionError: expected name token at '<![\n'

iritkatriel · 2022-01-14T14:02:16Z

The error() method was removed in bpo-31844.

hannob · 2022-01-14T14:29:30Z

Now the example code raises an AssertionError(). Is that intended? I don't think that's any better.

I usually wouldn't expect an HTML parser to raise any error if you pass it a string, but instead to do fault tolerant parsing. And if it's expected that some inputs can generate exceptions, at least I think this should be properly documented.

iritkatriel · 2022-01-14T14:32:47Z

Reopening to discuss what the correct behaviour should be.

…H-9295) Co-authored-by: Serhiy Storchaka <[email protected]>

…rser (pythonGH-9295) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

serhiy-storchaka · 2025-05-10T14:47:25Z

@ezio-melotti's PR has been merged.

Truncated comment, doctype declaration or CDATA section now produce a data.

…arser (GH-9295) (GH-133834) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

…arser (GH-9295) (GH-133833) (cherry picked from commit 76c0b01) Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

CPython fixed python/cpython#77057 and backported this to >= 3.13.4 This recently show up in Debian unstable, when we uploaded the latest 3.13 snapshot.

hannob mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 19, 2018

ezio-melotti self-assigned this Feb 26, 2018

ezio-melotti added 3.7 (EOL) end of life 3.8 (EOL) end of life labels Sep 14, 2018

iritkatriel closed this as completed Jan 14, 2022

iritkatriel added 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 (EOL) end of life labels Jan 14, 2022

iritkatriel reopened this Jan 14, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

ezio-melotti added this to html.parser issues May 2, 2022

ezio-melotti moved this to Todo in html.parser issues May 2, 2022

bedevere-app bot mentioned this issue Jan 17, 2024

gh-77057: Fix handling of invalid markup declarations in HTMLParser #9295

Merged

This was referenced May 7, 2025

_markupbase.py fails with TypeError on invalid keyword in marked section #81928

Closed

Not Implemented Error in stdLib HTMLParser #82754

Closed

serhiy-storchaka added a commit that referenced this issue May 10, 2025

gh-77057: Fix handling of invalid markup declarations in HTMLParser (G…

76c0b01

…H-9295) Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app bot mentioned this issue May 10, 2025

[3.14] gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295) #133833

Merged

bedevere-app bot mentioned this issue May 10, 2025

[3.13] gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295) #133834

Merged

serhiy-storchaka closed this as completed May 10, 2025

github-project-automation bot moved this from Todo to Done in html.parser issues May 10, 2025

serhiy-storchaka added 3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes and removed 3.11 only security fixes labels May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HTMLParser raises exception on some inputs #77057

HTMLParser raises exception on some inputs #77057

hannob mannequin commented Feb 19, 2018 •

edited by bedevere-app bot

Loading

hannob mannequin commented Feb 19, 2018 •

edited by serhiy-storchaka

Loading

Uh oh!

stevendaprano commented Feb 19, 2018

Uh oh!

hannob mannequin commented Feb 19, 2018

Uh oh!

ezio-melotti commented Feb 19, 2018

Uh oh!

berkerpeksag commented Aug 23, 2018

Uh oh!

ezio-melotti commented Sep 14, 2018

Uh oh!

iritkatriel commented Sep 9, 2021

Uh oh!

iritkatriel commented Jan 14, 2022

Uh oh!

hannob mannequin commented Jan 14, 2022

Uh oh!

iritkatriel commented Jan 14, 2022

Uh oh!

serhiy-storchaka commented May 10, 2025

Uh oh!

Uh oh!

HTMLParser raises exception on some inputs #77057

HTMLParser raises exception on some inputs #77057

Comments

hannob mannequin commented Feb 19, 2018 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked PRs

hannob mannequin commented Feb 19, 2018 • edited by serhiy-storchaka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevendaprano commented Feb 19, 2018

Uh oh!

hannob mannequin commented Feb 19, 2018

Uh oh!

ezio-melotti commented Feb 19, 2018

Uh oh!

berkerpeksag commented Aug 23, 2018

Uh oh!

ezio-melotti commented Sep 14, 2018

Uh oh!

iritkatriel commented Sep 9, 2021

Uh oh!

iritkatriel commented Jan 14, 2022

Uh oh!

hannob mannequin commented Jan 14, 2022

Uh oh!

iritkatriel commented Jan 14, 2022

Uh oh!

serhiy-storchaka commented May 10, 2025

Uh oh!

hannob mannequin commented Feb 19, 2018 •

edited by bedevere-app bot

Loading

hannob mannequin commented Feb 19, 2018 •

edited by serhiy-storchaka

Loading