Skip to content

htmlparser unclosed script tag causes data loss #86155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
waylan mannequin opened this issue Oct 10, 2020 · 3 comments
Closed

htmlparser unclosed script tag causes data loss #86155

waylan mannequin opened this issue Oct 10, 2020 · 3 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@waylan
Copy link
Mannequin

waylan mannequin commented Oct 10, 2020

BPO 41989
Nosy @terryjreedy, @ezio-melotti, @waylan
PRs
  • gh-86155: Fix htmlparser "unclosed script tag causes data loss" #22658
  • Files
  • test_html.py: A simple test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = None
    created_at = <Date 2020-10-10.01:08:29.514>
    labels = ['type-bug', 'library', '3.10']
    title = 'htmlparser unclosed script tag causes data loss'
    updated_at = <Date 2020-10-16.20:42:44.634>
    user = 'https://github.com/waylan'

    bugs.python.org fields:

    activity = <Date 2020-10-16.20:42:44.634>
    actor = 'terry.reedy'
    assignee = 'ezio.melotti'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2020-10-10.01:08:29.514>
    creator = 'waylan'
    dependencies = []
    files = ['49505']
    hgrepos = []
    issue_num = 41989
    keywords = ['patch']
    message_count = 2.0
    messages = ['378359', '378748']
    nosy_count = 3.0
    nosy_names = ['terry.reedy', 'ezio.melotti', 'waylan']
    pr_nums = ['22658']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue41989'
    versions = ['Python 3.10']

    Linked PRs

    @waylan
    Copy link
    Mannequin Author

    waylan mannequin commented Oct 10, 2020

    When the close method of the HtmlParser is called, any cached text data is generally flushed and passed to a data event; except when in data_mode. Specifically, if an unclosed script or style tag has been encountered, a call to close does not flush the data.

    A simple test which demonstrates the issue is attached.

    I see that in Lib/html/parser.py#L244-L249 there are two nested if statements which both check for not self.cdata_elem. Obviously, if we got past the first one, that situation will never exist for the nested one. Somehow this block of code needs a branch for when self.cdata_elem is True.

    I should note that the input is invalid HTML. However, the existing behavior results in data loss. Within any other unclosed tag (other than script or style) any data is still flushed and passed to a data event. I would expect the same behavior here. Although, the data escaping behavior should perhaps be applied as it is with data within properly closed tags.

    @waylan waylan mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 10, 2020
    @terryjreedy
    Copy link
    Member

    Waylan, 3.7 and before only get security fixes.

    To me, this might be considered an enhancement rather than bug fix, but I will leave that to Ezio.

    @terryjreedy terryjreedy removed 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes labels Oct 16, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @serhiy-storchaka serhiy-storchaka added 3.13 bugs and security fixes 3.14 bugs and security fixes and removed 3.10 only security fixes labels May 10, 2025
    @serhiy-storchaka serhiy-storchaka added the 3.15 new features, bugs and security fixes label May 10, 2025
    serhiy-storchaka pushed a commit that referenced this issue May 10, 2025
    …ser (GH-22658)
    
    When calling .close() the HTMLParser should flush all remaining content,
    even when that content is in an unclosed script or style tag.
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
    …TMLParser (pythonGH-22658)
    
    When calling .close() the HTMLParser should flush all remaining content,
    even when that content is in an unclosed script or style tag.
    (cherry picked from commit 53383e9)
    
    Co-authored-by: Waylan Limberg <[email protected]>
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
    …TMLParser (pythonGH-22658)
    
    When calling .close() the HTMLParser should flush all remaining content,
    even when that content is in an unclosed script or style tag.
    (cherry picked from commit 53383e9)
    
    Co-authored-by: Waylan Limberg <[email protected]>
    @serhiy-storchaka
    Copy link
    Member

    Thank you for your contribution, @waylan. Sorry it took so long to review your PR.

    @github-project-automation github-project-automation bot moved this from Todo to Done in html.parser issues May 10, 2025
    serhiy-storchaka pushed a commit that referenced this issue May 10, 2025
    …HTMLParser (GH-22658) (GH-133845)
    
    When calling .close() the HTMLParser should flush all remaining content,
    even when that content is in an unclosed script or style tag.
    (cherry picked from commit 53383e9)
    
    Co-authored-by: Waylan Limberg <[email protected]>
    serhiy-storchaka pushed a commit that referenced this issue May 10, 2025
    …HTMLParser (GH-22658) (GH-133844)
    
    When calling .close() the HTMLParser should flush all remaining content,
    even when that content is in an unclosed script or style tag.
    (cherry picked from commit 53383e9)
    
    Co-authored-by: Waylan Limberg <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    3 participants