Skip to content

gh-118350: Add escapable-raw-text mode to html parser #121770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

timonviola
Copy link
Contributor

@timonviola timonviola commented Jul 14, 2024

escapable raw text elements are not handled in the current HTMLParser implementation.

This PR extends the existing parser with an additional mode to handle this correctly.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between processing raw text elements and escapable raw text elements? I do not see any this code.

("data", content),
("endtag", element_lower)])

def test_escapable_raw_text_with_closing_tags(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right? The test name is test_escapable_raw_text_with_closing_tags, but it tests the script element. It looks very similar to test_cdata_with_closing_tags.

'<!-- not a comment --> &not-an-entity-ref;',
"<not a='start tag'>",
'<a href="" /> <p> <span></span>',
'foo = "</scr" + "ipt>";',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why test this in the title and textarea elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add also examples of valid character references and an ambiguous ampersand.

@timonviola
Copy link
Contributor Author

@ezio-melotti @serhiy-storchaka can you help with the review?

Comment on lines 411 to 414
('starttag', 'title', []), ('data', text),
('endtag', 'title'), ('data', '"'),
('starttag', 'textarea', []), ('data', text),
('endtag', 'textarea'), ('data', '"')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. Charrefs should be resolved in escapable raw text elements. Data should be '"X"X"' instead of text. Except for an ambiguous ampersand.

@@ -317,6 +319,34 @@ def get_events(self):
("endtag", element_lower)],
collector=Collector(convert_charrefs=False))

def test_escapable_raw_text_content(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this test differ from test_cdata_content? BTW, most examples use JavaScript syntax, and only relevant for <script>.

@@ -28,6 +28,7 @@

starttagopen = re.compile('<[a-zA-Z]')
piclose = re.compile('>')
escapable_raw_text_close = re.compile('</(title|textarea)>', re.I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even used?

if self.cdata_elem:
break
j = n
if i < j:
if self.convert_charrefs and not self.cdata_elem:
if self.convert_charrefs and not self.cdata_elem and not self.escapable_raw_text_elem:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Charrefs should be resolved in an escapable raw text element. Except an ambiguous ampersand.

We need also tests for convert_charrefs=False in an escapable raw text element.

@@ -138,6 +141,14 @@ def get_starttag_text(self):
"""Return full source of start tag: '<...>'."""
return self.__starttag_text

def set_escapable_raw_text_mode(self, elem):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the behavior for raw text elements and escapable raw text elements is so similar, and they cannot be nested, why not use set_cdata_mode() and cdata_elem for both? Just add an optional boolean parameter to specify whether it is escapable (charrefs should be unescaped) or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serhiy-storchaka I can do that.

('entityref', 'amp'),
('data', ' Pumba')
],
collector=Collector(convert_charrefs=False),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean this test? @serhiy-storchaka

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks.

@serhiy-storchaka serhiy-storchaka marked this pull request as draft May 15, 2025 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants