Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re: missing whitespace in root node #34

Closed
4 tasks done
milahu opened this issue Nov 20, 2022 · 5 comments
Closed
4 tasks done

re: missing whitespace in root node #34

milahu opened this issue Nov 20, 2022 · 5 comments
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on

Comments

@milahu
Copy link

milahu commented Nov 20, 2022

Initial checklist

Affected packages and versions

#33

Link to runnable example

No response

Steps to reproduce

#33

This is required by the HTML standard, which we follow.

so my input html is invalid?

<!doctype html>
<html lang="en">
  <head>
    <title>test</title>
  </head>
  <body>
    <div>test</div>
  </body>
</html>

is valid html per https://validator.w3.org/nu/#textarea

the unified toolchain mangles this to

<!doctype html><html lang="en"><head>
    <title>test</title>
  </head>
  <body>
    <div>test</div>
  

</body></html>

the unified toolchain

i dont know where exactly the bug is.
maybe in hast-util-to-html, maybe somewhere else

That behavior is not always useful.

its certainly useful for writing lossless transformers, which produce minimal diffs

maybe this could cause trouble in some tree consumers like hast-util-select
which expect "no whitespace in the root node"?

If you want pretty HTML, use rehype-format!

no, i want to preserve the original whitespace

Expected behavior

#33

Actual behavior

#33

Affected runtime and version

#33

Affected package manager and version

No response

Affected OS and version

No response

Build and bundle tools

No response

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Nov 20, 2022
@wooorm
Copy link
Member

wooorm commented Nov 20, 2022

You can still comment on closed issues.

This isn’t about valid or invalid.
This is about the HTML standard prescribing how to parse HTML.
We follow how the HTML standards prescribes how to parse HTML.

We use an AST (abstract). Not a CST (concrete). We don’t generate the exact input document. We generate an equivalent document.

@wooorm wooorm closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2022
@wooorm wooorm added the 🙋 no/question This does not need any changes label Nov 20, 2022
@github-actions

This comment has been minimized.

@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Nov 20, 2022
@milahu
Copy link
Author

milahu commented Nov 20, 2022

sorry .__.
please feel free to merge the issues

We follow how the HTML standards prescribes how to parse HTML.

rendering html != transforming html

We use an AST (abstract). Not a CST (concrete). We don’t generate the exact input document. We generate an equivalent document.

the parse tree is 99% concrete already.
it contains all the whitespce, except around the root node.
making it 100% concrete is a small fix with a large gain (at least for my use case).
at least this could be made optional

the resulting parser may be less "strict" and more "loose" than a perfectly spec-compliant parser ...
but so what? whats the punishment of breaking the spec?
except that html rendering is 0.1% slower, to skip some whitespace nodes

@wooorm
Copy link
Member

wooorm commented Nov 20, 2022

please feel free to merge the issues

That’s not possible through the GH UI, unfortunately!

making it 100% concrete is a small fix

It’s extremely complex. A concrete tree would house all original sources of character references. It would house whitespace inside tags. It would house the casing of attributes. It would include information on whether double or single quotes are used. It would information in doctypes, processing instructions. It would contain casing of tag names. It would be months of work to create.

at least this could be made optional

It’s a ton of work, probably a full year to implement a new parser and design the data structure. If you are interested in paying me to make it reality I would likely quote you $200k.

the resulting parser may be less "strict" and more "loose" than a perfectly spec-compliant parser ...

This project adheres to the HTML spec.

whats the punishment of breaking the spec?

In this particular case of only the whitespace? It would indeed probably not be terribly complex to implement, but parse5 doesn’t support it either. Feel free to investigate.

But we have had similar conversations about this: hast is for the HTML spec. If you want XML, use xast. If you want to use a different project that does not adhere to the spec, feel free to make it yourself.

@milahu
Copy link
Author

milahu commented Nov 20, 2022

thanks.

feel free to make it yourself.

yes. i would use a concrete parser like

with a different transformer (different node names, different data structure)
which is cheap, because my transformer is simple

edit: for my use case "html to html transformer"
i would use a different framework, probably eslint
where i have concrete parsers by default
and where i can modify the source string (not modify the parse tree)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

2 participants