-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-118350: Add escapable-raw-text mode to html parser #121770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between processing raw text elements and escapable raw text elements? I do not see any this code.
Lib/test/test_htmlparser.py
Outdated
("data", content), | ||
("endtag", element_lower)]) | ||
|
||
def test_escapable_raw_text_with_closing_tags(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right? The test name is test_escapable_raw_text_with_closing_tags, but it tests the script element. It looks very similar to test_cdata_with_closing_tags.
Lib/test/test_htmlparser.py
Outdated
'<!-- not a comment --> ¬-an-entity-ref;', | ||
"<not a='start tag'>", | ||
'<a href="" /> <p> <span></span>', | ||
'foo = "</scr" + "ipt>";', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why test this in the title and textarea elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add also examples of valid character references and an ambiguous ampersand.
@ezio-melotti @serhiy-storchaka can you help with the review? |
Lib/test/test_htmlparser.py
Outdated
('starttag', 'title', []), ('data', text), | ||
('endtag', 'title'), ('data', '"'), | ||
('starttag', 'textarea', []), ('data', text), | ||
('endtag', 'textarea'), ('data', '"')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. Charrefs should be resolved in escapable raw text elements. Data should be '"X"X"'
instead of text
. Except for an ambiguous ampersand.
Lib/test/test_htmlparser.py
Outdated
@@ -317,6 +319,34 @@ def get_events(self): | |||
("endtag", element_lower)], | |||
collector=Collector(convert_charrefs=False)) | |||
|
|||
def test_escapable_raw_text_content(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this test differ from test_cdata_content? BTW, most examples use JavaScript syntax, and only relevant for <script>
.
Lib/html/parser.py
Outdated
@@ -28,6 +28,7 @@ | |||
|
|||
starttagopen = re.compile('<[a-zA-Z]') | |||
piclose = re.compile('>') | |||
escapable_raw_text_close = re.compile('</(title|textarea)>', re.I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even used?
Lib/html/parser.py
Outdated
if self.cdata_elem: | ||
break | ||
j = n | ||
if i < j: | ||
if self.convert_charrefs and not self.cdata_elem: | ||
if self.convert_charrefs and not self.cdata_elem and not self.escapable_raw_text_elem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. Charrefs should be resolved in an escapable raw text element. Except an ambiguous ampersand.
We need also tests for convert_charrefs=False
in an escapable raw text element.
Lib/html/parser.py
Outdated
@@ -138,6 +141,14 @@ def get_starttag_text(self): | |||
"""Return full source of start tag: '<...>'.""" | |||
return self.__starttag_text | |||
|
|||
def set_escapable_raw_text_mode(self, elem): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the behavior for raw text elements and escapable raw text elements is so similar, and they cannot be nested, why not use set_cdata_mode()
and cdata_elem
for both? Just add an optional boolean parameter to specify whether it is escapable (charrefs should be unescaped) or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serhiy-storchaka I can do that.
('entityref', 'amp'), | ||
('data', ' Pumba') | ||
], | ||
collector=Collector(convert_charrefs=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean this test? @serhiy-storchaka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thanks.
escapable raw text elements are not handled in the current HTMLParser implementation.
This PR extends the existing parser with an additional mode to handle this correctly.
<style>
tag #118350