Skip to content

HTMLParser raises exception on some inputs #77057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hannob mannequin opened this issue Feb 19, 2018 · 11 comments
Closed

HTMLParser raises exception on some inputs #77057

hannob mannequin opened this issue Feb 19, 2018 · 11 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@hannob
Copy link
Mannequin

hannob mannequin commented Feb 19, 2018

BPO 32876
Nosy @ezio-melotti, @stevendaprano, @berkerpeksag, @hannob, @iritkatriel
PRs
  • gh-77057: Fix handling of invalid markup declarations in HTMLParser #9295
  • Superseder
  • bpo-31844: HTMLParser: undocumented not implemented method
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = None
    created_at = <Date 2018-02-19.19:52:16.326>
    labels = ['type-bug', 'library', '3.11']
    title = 'HTMLParser raises exception on some inputs'
    updated_at = <Date 2022-01-14.14:32:47.114>
    user = 'https://github.com/hannob'

    bugs.python.org fields:

    activity = <Date 2022-01-14.14:32:47.114>
    actor = 'iritkatriel'
    assignee = 'ezio.melotti'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-02-19.19:52:16.326>
    creator = 'hanno'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32876
    keywords = ['patch']
    message_count = 10.0
    messages = ['312363', '312379', '312380', '312381', '323971', '325330', '401507', '410559', '410561', '410563']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'steven.daprano', 'berker.peksag', 'hanno', 'iritkatriel']
    pr_nums = ['9295']
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'open'
    superseder = '31844'
    type = 'behavior'
    url = 'https://bugs.python.org/issue32876'
    versions = ['Python 3.11']

    Linked PRs

    @hannob
    Copy link
    Mannequin Author

    hannob mannequin commented Feb 19, 2018

    I noticed that the HTMLParser will raise an exception on some inputs.
    I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

    Here's a minified example:

    #!/usr/bin/env python3
    import html.parser
    html.parser.HTMLParser().feed("<![\n")

    However I actually stepped upon HTML failing on a real webpage:
    https://kafanews.com/

    Exception of minified example:

    Traceback (most recent call last):
      File "./foo.py", line 5, in <module>
        html.parser.HTMLParser().feed("<![\n")
      File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
        self.goahead(0)
      File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
        k = self.parse_html_declaration(i)
      File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
        return self.parse_marked_section(i)
      File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
        sectName, j = self._scan_name( i+3, i )
      File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
        % rawdata[declstartpos:declstartpos+20])
      File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
        "subclasses of ParserBase must override error()")
    NotImplementedError: subclasses of ParserBase must override error()

    @hannob hannob mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 19, 2018
    @stevendaprano
    Copy link
    Member

    The stdlib HTML parser requires correct HTML.

    To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can.

    I doubt the stdlib will ever compete with BeautifulSoup.

    @hannob
    Copy link
    Mannequin Author

    hannob mannequin commented Feb 19, 2018

    Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)

    @ezio-melotti
    Copy link
    Member

    The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.

    @ezio-melotti ezio-melotti self-assigned this Feb 26, 2018
    @berkerpeksag
    Copy link
    Member

    bpo-34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this.

    @ezio-melotti
    Copy link
    Member

    There are at least a couple of issues here.

    The first one is the way the parser handles '<![...'. The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised.
    However "8.2.4.42. Markup declaration open state"0, states that after consuming '<!', there are only 4 valid paths forward:

    1. if we have '<!--', it's a comment;
    2. if we have '<!doctype', it's a doctype declaration;
    3. if we have '<![CDATA[', it's a CDATA section;
    4. if it's something else, it's a bogus comment;

    The above example should therefore fall into 4), and be treated like a bogus comment.

    PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment).

    The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n"). In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted. I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue).

    @ezio-melotti ezio-melotti added 3.7 (EOL) end of life 3.8 (EOL) end of life labels Sep 14, 2018
    @iritkatriel
    Copy link
    Member

    I get a different error now:

    >>> import html.parser
    >>> html.parser.HTMLParser().feed("<![\n")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 110, in feed
        self.goahead(0)
        ^^^^^^^^^^^^^^^
      File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 178, in goahead
        k = self.parse_html_declaration(i)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 263, in parse_html_declaration
        return self.parse_marked_section(i)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 144, in parse_marked_section
        sectName, j = self._scan_name( i+3, i )
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 390, in _scan_name
        raise AssertionError(
        ^^^^^^^^^^^^^^^^^^^^^
    AssertionError: expected name token at '<![\n'

    @iritkatriel
    Copy link
    Member

    The error() method was removed in bpo-31844.

    @hannob
    Copy link
    Mannequin Author

    hannob mannequin commented Jan 14, 2022

    Now the example code raises an AssertionError(). Is that intended? I don't think that's any better.

    I usually wouldn't expect an HTML parser to raise any error if you pass it a string, but instead to do fault tolerant parsing. And if it's expected that some inputs can generate exceptions, at least I think this should be properly documented.

    @iritkatriel
    Copy link
    Member

    Reopening to discuss what the correct behaviour should be.

    @iritkatriel iritkatriel added 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 (EOL) end of life labels Jan 14, 2022
    @iritkatriel iritkatriel reopened this Jan 14, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
    …rser (pythonGH-9295)
    
    (cherry picked from commit 76c0b01)
    
    Co-authored-by: Ezio Melotti <[email protected]>
    Co-authored-by: Serhiy Storchaka <[email protected]>
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
    …rser (pythonGH-9295)
    
    (cherry picked from commit 76c0b01)
    
    Co-authored-by: Ezio Melotti <[email protected]>
    Co-authored-by: Serhiy Storchaka <[email protected]>
    @serhiy-storchaka
    Copy link
    Member

    @ezio-melotti's PR has been merged.

    Truncated comment, doctype declaration or CDATA section now produce a data.

    @github-project-automation github-project-automation bot moved this from Todo to Done in html.parser issues May 10, 2025
    @serhiy-storchaka serhiy-storchaka added 3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes and removed 3.11 only security fixes labels May 10, 2025
    serhiy-storchaka added a commit that referenced this issue May 10, 2025
    …arser (GH-9295) (GH-133834)
    
    (cherry picked from commit 76c0b01)
    
    Co-authored-by: Ezio Melotti <[email protected]>
    Co-authored-by: Serhiy Storchaka <[email protected]>
    serhiy-storchaka added a commit that referenced this issue May 10, 2025
    …arser (GH-9295) (GH-133833)
    
    (cherry picked from commit 76c0b01)
    
    Co-authored-by: Ezio Melotti <[email protected]>
    Co-authored-by: Serhiy Storchaka <[email protected]>
    nschloe pushed a commit to live-clones/beautifulsoup that referenced this issue May 25, 2025
    CPython fixed python/cpython#77057 and
    backported this to >= 3.13.4
    
    This recently show up in Debian unstable, when we uploaded the latest
    3.13 snapshot.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    5 participants