gh-118350: Add escapable-raw-text mode to html parser #121770

timonviola · 2024-07-14T14:03:12Z

escapable raw text elements are not handled in the current HTMLParser implementation.

This PR extends the existing parser with an additional mode to handle this correctly.

Issue: HTMLParser stops parsing upon encountering <style> tag #118350

serhiy-storchaka

What is the difference between processing raw text elements and escapable raw text elements? I do not see any this code.

serhiy-storchaka · 2025-05-07T10:47:49Z

Lib/test/test_htmlparser.py

+                                    ("data", content),
+                                    ("endtag", element_lower)])
+
+    def test_escapable_raw_text_with_closing_tags(self):


Is it right? The test name is test_escapable_raw_text_with_closing_tags, but it tests the script element. It looks very similar to test_cdata_with_closing_tags.

serhiy-storchaka · 2025-05-07T10:49:02Z

Lib/test/test_htmlparser.py

+            '<!-- not a comment --> &not-an-entity-ref;',
+            "<not a='start tag'>",
+            '<a href="" /> <p> <span></span>',
+            'foo = "</scr" + "ipt>";',


Why test this in the title and textarea elements?

Add also examples of valid character references and an ambiguous ampersand.

…nt test

timonviola · 2025-05-13T20:13:32Z

@ezio-melotti @serhiy-storchaka can you help with the review?

serhiy-storchaka · 2025-05-14T08:44:36Z

Lib/test/test_htmlparser.py

+                        ('starttag', 'title', []), ('data', text),
+                        ('endtag', 'title'), ('data', '"'),
+                        ('starttag', 'textarea', []), ('data', text),
+                        ('endtag', 'textarea'), ('data', '"')]


This is not correct. Charrefs should be resolved in escapable raw text elements. Data should be '"X"X"' instead of text. Except for an ambiguous ampersand.

serhiy-storchaka · 2025-05-14T08:50:16Z

Lib/test/test_htmlparser.py

@@ -317,6 +319,34 @@ def get_events(self):
                                ("endtag", element_lower)],
                            collector=Collector(convert_charrefs=False))

+    def test_escapable_raw_text_content(self):


How does this test differ from test_cdata_content? BTW, most examples use JavaScript syntax, and only relevant for <script>.

serhiy-storchaka · 2025-05-14T09:22:49Z

Lib/html/parser.py

@@ -28,6 +28,7 @@

 starttagopen = re.compile('<[a-zA-Z]')
 piclose = re.compile('>')
+escapable_raw_text_close = re.compile('</(title|textarea)>', re.I)


Is it even used?

serhiy-storchaka · 2025-05-14T09:26:51Z

Lib/html/parser.py

                    if self.cdata_elem:
                        break
                    j = n
            if i < j:
-                if self.convert_charrefs and not self.cdata_elem:
+                if self.convert_charrefs and not self.cdata_elem and not self.escapable_raw_text_elem:


This is incorrect. Charrefs should be resolved in an escapable raw text element. Except an ambiguous ampersand.

We need also tests for convert_charrefs=False in an escapable raw text element.

serhiy-storchaka · 2025-05-14T09:35:38Z

Lib/html/parser.py

@@ -138,6 +141,14 @@ def get_starttag_text(self):
        """Return full source of start tag: '<...>'."""
        return self.__starttag_text

+    def set_escapable_raw_text_mode(self, elem):


Since the behavior for raw text elements and escapable raw text elements is so similar, and they cannot be nested, why not use set_cdata_mode() and cdata_elem for both? Just add an optional boolean parameter to specify whether it is escapable (charrefs should be unescaped) or not.

@serhiy-storchaka I can do that.

timonviola · 2025-05-14T19:01:30Z

Lib/test/test_htmlparser.py

+                  ('entityref', 'amp'),
+                  ('data', ' Pumba')
+                ],
+                collector=Collector(convert_charrefs=False),


Did you mean this test? @serhiy-storchaka

Yes. Thanks.

fix: add escapable raw text mode to html parsel

420af54

timonviola requested a review from ezio-melotti as a code owner July 14, 2024 14:03

bedevere-app bot mentioned this pull request Jul 14, 2024

HTMLParser stops parsing upon encountering <style> tag #118350

Open

bedevere-app bot added the awaiting review label Jul 14, 2024

ezio-melotti self-assigned this Jul 14, 2024

serhiy-storchaka reviewed May 7, 2025

View reviewed changes

timonviola added 4 commits May 13, 2025 21:21

test: add character reference and ambiguous ampersand test cases

1241a65

test: remove irrelevant test

e7f11a0

test: add charachter reference tests for raw escapable elements

bd63490

test: include raw text and escapable raw text elements in cdata conte…

da868db

…nt test

Merge branch 'main' into fix-issue-118350

d17b409

serhiy-storchaka reviewed May 14, 2025

View reviewed changes

timonviola added 4 commits May 14, 2025 19:58

Merge branch 'python:main' into fix-issue-118350

d8cc255

update to latest main

a36070a

test: add failing test

43804bb

test: add charref test

70b8e5d

timonviola commented May 14, 2025

View reviewed changes

serhiy-storchaka marked this pull request as draft May 15, 2025 07:11

bedevere-app bot removed the awaiting review label May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-118350: Add escapable-raw-text mode to html parser #121770

gh-118350: Add escapable-raw-text mode to html parser #121770

Uh oh!

timonviola commented Jul 14, 2024 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka May 7, 2025

Uh oh!

serhiy-storchaka May 7, 2025

Uh oh!

serhiy-storchaka May 7, 2025

Uh oh!

timonviola commented May 13, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

timonviola May 14, 2025

Uh oh!

timonviola May 14, 2025

Uh oh!

serhiy-storchaka May 14, 2025

Uh oh!

Uh oh!

Uh oh!

gh-118350: Add escapable-raw-text mode to html parser #121770

Are you sure you want to change the base?

gh-118350: Add escapable-raw-text mode to html parser #121770

Uh oh!

Conversation

timonviola commented Jul 14, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timonviola commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timonviola commented Jul 14, 2024 •

edited by bedevere-app bot

Loading