re: missing whitespace in root node #34

milahu · 2022-11-20T07:34:19Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

#33

Link to runnable example

No response

Steps to reproduce

#33

This is required by the HTML standard, which we follow.

so my input html is invalid?

<!doctype html>
<html lang="en">
  <head>
    <title>test</title>
  </head>
  <body>
    <div>test</div>
  </body>
</html>

is valid html per https://validator.w3.org/nu/#textarea

the unified toolchain mangles this to

<!doctype html><html lang="en"><head>
    <title>test</title>
  </head>
  <body>
    <div>test</div>
  

</body></html>

the unified toolchain

i dont know where exactly the bug is.
maybe in hast-util-to-html, maybe somewhere else

That behavior is not always useful.

its certainly useful for writing lossless transformers, which produce minimal diffs

maybe this could cause trouble in some tree consumers like hast-util-select
which expect "no whitespace in the root node"?

If you want pretty HTML, use rehype-format!

no, i want to preserve the original whitespace

Affected package manager and version

No response

Affected OS and version

No response

Build and bundle tools

No response

wooorm · 2022-11-20T08:14:51Z

You can still comment on closed issues.

This isn’t about valid or invalid.
This is about the HTML standard prescribing how to parse HTML.
We follow how the HTML standards prescribes how to parse HTML.

We use an AST (abstract). Not a CST (concrete). We don’t generate the exact input document. We generate an equivalent document.

milahu · 2022-11-20T12:40:06Z

sorry .__.
please feel free to merge the issues

We follow how the HTML standards prescribes how to parse HTML.

rendering html != transforming html

We use an AST (abstract). Not a CST (concrete). We don’t generate the exact input document. We generate an equivalent document.

the parse tree is 99% concrete already.
it contains all the whitespce, except around the root node.
making it 100% concrete is a small fix with a large gain (at least for my use case).
at least this could be made optional

the resulting parser may be less "strict" and more "loose" than a perfectly spec-compliant parser ...
but so what? whats the punishment of breaking the spec?
except that html rendering is 0.1% slower, to skip some whitespace nodes

wooorm · 2022-11-20T12:54:19Z

please feel free to merge the issues

That’s not possible through the GH UI, unfortunately!

making it 100% concrete is a small fix

It’s extremely complex. A concrete tree would house all original sources of character references. It would house whitespace inside tags. It would house the casing of attributes. It would include information on whether double or single quotes are used. It would information in doctypes, processing instructions. It would contain casing of tag names. It would be months of work to create.

at least this could be made optional

It’s a ton of work, probably a full year to implement a new parser and design the data structure. If you are interested in paying me to make it reality I would likely quote you $200k.

the resulting parser may be less "strict" and more "loose" than a perfectly spec-compliant parser ...

This project adheres to the HTML spec.

whats the punishment of breaking the spec?

In this particular case of only the whitespace? It would indeed probably not be terribly complex to implement, but parse5 doesn’t support it either. Feel free to investigate.

But we have had similar conversations about this: hast is for the HTML spec. If you want XML, use xast. If you want to use a different project that does not adhere to the spec, feel free to make it yourself.

milahu · 2022-11-20T13:12:01Z

thanks.

feel free to make it yourself.

yes. i would use a concrete parser like

tree-sitter-html with treesitter-to-unist
lezer-parser-html to avoid WASM

with a different transformer (different node names, different data structure)
which is cheap, because my transformer is simple

edit: for my use case "html to html transformer"
i would use a different framework, probably eslint
where i have concrete parsers by default
and where i can modify the source string (not modify the parse tree)

github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Nov 20, 2022

wooorm closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2022

wooorm added the 🙋 no/question This does not need any changes label Nov 20, 2022

This comment has been minimized.

Sign in to view

github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Nov 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re: missing whitespace in root node #34

re: missing whitespace in root node #34

milahu commented Nov 20, 2022

wooorm commented Nov 20, 2022

This comment has been minimized.

milahu commented Nov 20, 2022

wooorm commented Nov 20, 2022

milahu commented Nov 20, 2022 •

edited

Loading

re: missing whitespace in root node #34

re: missing whitespace in root node #34

Comments

milahu commented Nov 20, 2022

Initial checklist

Affected packages and versions

Link to runnable example

Steps to reproduce

Expected behavior

Actual behavior

Affected runtime and version

Affected package manager and version

Affected OS and version

Build and bundle tools

wooorm commented Nov 20, 2022

This comment has been minimized.

milahu commented Nov 20, 2022

wooorm commented Nov 20, 2022

milahu commented Nov 20, 2022 • edited Loading

milahu commented Nov 20, 2022 •

edited

Loading