HTML in text handling error? #373

designosis · 2022-05-28T04:57:28Z

In the demo, with the text <p>Guess what?</p> in the first field, and a space after the ? in the second ...

It seems to think the </ has changed as well? Perhaps it groups all non-alphanumerics? I'd suggest adding a parameter that gives tags special treatment.

Edit: I guess this isn't designed to handle HTML tags at all :) Is there a way to render existing tags (such as <p></p> or <h1></h1>)?

The text was updated successfully, but these errors were encountered:

ExplodingCabbage · 2024-01-10T12:37:25Z

Perhaps it groups all non-alphanumerics?

Yeah, essentially this. diffWords basically splits the text into an array of tokens where each token can be:

a word, or
a newline, or
a run of non-newline whitespace characters, or
a run of punctuation characters

and then diffs the sequence of tokens. So without the space, '?</' is one token, and with the space, '?', ' ', '</' are three distinct tokens.

ExplodingCabbage · 2024-01-10T12:55:43Z

More thoughts:

I'm hoping to significantly rework the diffWords behaviour soon in the next breaking-changes release of jsdiff, but tbh I don't think I'll fix this. diffWords is intended for natural language text, not HTML or other code. To do some kind of semantics-aware diff of HTML, you'll want to tokenize the text yourself. e.g. perhaps with the help of a HTML parser you could construct an array where each token is one of:

a doctype declaration
a tag
a comment
an attribute name+value
the text content of an element

Then you could diff this with diffArrays, perhaps using a comparator that ignores semantically-irrelevant whitespace changes. (I've recently added a brief section in the README discussing this approach.)

The crux from my perspective is, though, a tokenization approach optimised for meaningfully diffing HTML is likely to look very different for one meant for diffing natural language text, because the syntax of HTML vs ordinary text is meaningfully different, and so I wouldn't even really want to add options to diffWords to try to optimize it for diffing code; I'd expect it to still do a poor job and end up confusingly complicated in the process.

It's fine, of course, to ignore this and use diffWords as a quick-and-dirty way of diffing code if warts like the one shown in this issue aren't important to you. I just think this approach should be seen as a quick-and-dirty hack, and that users shouldn't expect diffWords to magically work nicely on both prose and code. Right now it kind of sucks for both (see #311, #214, #436 for some of the ways it behaves horribly even on prose), and I'd like to write off using it for diffing code as something we don't support so I have more freedom to try to fix those issues and optimise it for diffing prose.

For that reason, although this issue is not unreasonable, I'm gonna close it as "Won't fix".

ExplodingCabbage added the diffWords behaviour label Jan 10, 2024

ExplodingCabbage closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML in text handling error? #373

HTML in text handling error? #373

designosis commented May 28, 2022 •

edited

Loading

ExplodingCabbage commented Jan 10, 2024

ExplodingCabbage commented Jan 10, 2024

HTML in text handling error? #373

HTML in text handling error? #373

Comments

designosis commented May 28, 2022 • edited Loading

ExplodingCabbage commented Jan 10, 2024

ExplodingCabbage commented Jan 10, 2024

designosis commented May 28, 2022 •

edited

Loading