Skip to content

Commit 59161e0

Browse files
Fix diffWords handling of whitespace (#497)
1 parent 490f5ab commit 59161e0

File tree

9 files changed

+611
-180
lines changed

9 files changed

+611
-180
lines changed

Diff for: README.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -36,16 +36,14 @@ Broadly, jsdiff's diff functions all take an old text and a new text and perform
3636
Options
3737
* `ignoreCase`: If `true`, the uppercase and lowercase forms of a character are considered equal. Defaults to `false`.
3838

39-
* `Diff.diffWords(oldStr, newStr[, options])` - diffs two blocks of text, treating each word and each word separator (punctuation, newline, or run of whitespace) as a token.
40-
41-
(Whitespace-only tokens are automatically treated as equal to each other, so changes like changing a space to a newline or a run of multiple spaces will be ignored.)
39+
* `Diff.diffWords(oldStr, newStr[, options])` - diffs two blocks of text, treating each word and each punctuation mark as a token. Whitespace is ignored when computing the diff (but preserved as far as possible in the final change objects).
4240

4341
Returns a list of [change objects](#change-objects).
4442

4543
Options
4644
* `ignoreCase`: Same as in `diffChars`. Defaults to false.
4745

48-
* `Diff.diffWordsWithSpace(oldStr, newStr[, options])` - same as `diffWords`, except whitespace-only tokens are not automatically considered equal, so e.g. changing a space to a tab is considered a change.
46+
* `Diff.diffWordsWithSpace(oldStr, newStr[, options])` - diffs two blocks of text, treating each word, punctuation mark, newline, or run of (non-newline) whitespace as a token.
4947

5048
* `Diff.diffLines(oldStr, newStr[, options])` - diffs two blocks of text, treating each line as a token.
5149

@@ -184,6 +182,7 @@ For even more customisation of the diffing behavior, you can create a `new Diff.
184182
* `removeEmpty(array)`: called on the arrays of tokens returned by `tokenize` and can be used to modify them. Defaults to stripping out falsey tokens, such as empty strings. `diffArrays` overrides this to simply return the `array`, which means that falsey values like empty strings can be handled like any other token by `diffArrays`.
185183
* `equals(left, right, options)`: called to determine if two tokens (one from the old string, one from the new string) should be considered equal. Defaults to comparing them with `===`.
186184
* `join(tokens)`: gets called with an array of consecutive tokens that have either all been added, all been removed, or are all common. Needs to join them into a single value that can be used as the `value` property of the [change object](#change-objects) for these tokens. Defaults to simply returning `tokens.join('')`.
185+
* `postProcess(changeObjects)`: gets called at the end of the algorithm with the [change objects](#change-objects) produced, and can do final cleanups on them. Defaults to simply returning `changeObjects` unchanged.
187186
188187
### Change Objects
189188
Many of the methods above return change objects. These objects consist of the following fields:

Diff for: release-notes.md

+17-9
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,25 @@
44

55
[Commits](https://github.com/kpdecker/jsdiff/compare/master...v6.0.0-staging)
66

7-
- [#435](https://github.com/kpdecker/jsdiff/pull/435) Fix `parsePatch` handling of control characters. `parsePatch` used to interpret various unusual control characters - namely vertical tabs, form feeds, lone carriage returns without a line feed, and EBCDIC NELs - as line breaks when parsing a patch file. This was inconsistent with the behavior of both JsDiff's own `diffLines` method and also the Unix `diff` and `patch` utils, which all simply treat those control characters as ordinary characters. The result of this discrepancy was that some well-formed patches - produced either by `diff` or by JsDiff itself and handled properly by the `patch` util - would be wrongly parsed by `parsePatch`, with the effect that it would disregard the remainder of a hunk after encountering one of these control characters.
7+
- [#497](https://github.com/kpdecker/jsdiff/pull/497) **`diffWords` behavior has been radically changed.** Previously, even with `ignoreWhitespace: true`, runs of whitespace were tokens, which led to unhelpful and unintuitive diffing behavior in typical texts. Specifically, even when two texts contained overlapping passages, `diffWords` would sometimes choose to delete all the words from the old text and insert them anew in their new positions in order to avoid having to delete or insert whitespace tokens. Whitespace sequences are no longer tokens as of this release, which affects both the generated diffs and the `count`s.
8+
9+
Runs of whitespace are still tokens in `diffWordsWithSpace`.
10+
11+
As part of the changes to `diffWords`, **a new `.postProcess` method has been added on the base `Diff` type**, which can be overridden in custom `Diff` implementations.
12+
13+
**`diffLines` with `ignoreWhitespace: true` will no longer ignore the insertion or deletion of entire extra lines of whitespace at the end of the text**. Previously, these would not show up as insertions or deletions, as a side effect of a hack in the base diffing algorithm meant to help ignore whitespace in `diffWords`. More generally, **the undocumented special handling in the core algorithm for ignored terminals has been removed entirely.** (This special case behavior used to rewrite the final two change objects in a scenario where the final change object was an addition or deletion and its `value` was treated as equal to the empty string when compared using the diff object's `.equals` method.)
14+
815
- [#500](https://github.com/kpdecker/jsdiff/pull/500) **`diffChars` now diffs Unicode code points** instead of UTF-16 code units.
9-
- [#439](https://github.com/kpdecker/jsdiff/pull/439) Prefer diffs that order deletions before insertions. When faced with a choice between two diffs with an equal total edit distance, the Myers diff algorithm generally prefers one that does deletions before insertions rather than insertions before deletions. For instance, when diffing `abcd` against `acbd`, it will prefer a diff that says to delete the `b` and then insert a new `b` after the `c`, over a diff that says to insert a `c` before the `b` and then delete the existing `c`. JsDiff deviated from the published Myers algorithm in a way that led to it having the opposite preference in many cases, including that example. This is now fixed, meaning diffs output by JsDiff will more accurately reflect what the published Myers diff algorithm would output.
10-
- [#455](https://github.com/kpdecker/jsdiff/pull/455) The `added` and `removed` properties of change objects are now guaranteed to be set to a boolean value. (Previously, they would be set to `undefined` or omitted entirely instead of setting them to false.)
16+
- [#435](https://github.com/kpdecker/jsdiff/pull/435) **Fix `parsePatch` handling of control characters.** `parsePatch` used to interpret various unusual control characters - namely vertical tabs, form feeds, lone carriage returns without a line feed, and EBCDIC NELs - as line breaks when parsing a patch file. This was inconsistent with the behavior of both JsDiff's own `diffLines` method and also the Unix `diff` and `patch` utils, which all simply treat those control characters as ordinary characters. The result of this discrepancy was that some well-formed patches - produced either by `diff` or by JsDiff itself and handled properly by the `patch` util - would be wrongly parsed by `parsePatch`, with the effect that it would disregard the remainder of a hunk after encountering one of these control characters.
17+
- [#439](https://github.com/kpdecker/jsdiff/pull/439) **Prefer diffs that order deletions before insertions.** When faced with a choice between two diffs with an equal total edit distance, the Myers diff algorithm generally prefers one that does deletions before insertions rather than insertions before deletions. For instance, when diffing `abcd` against `acbd`, it will prefer a diff that says to delete the `b` and then insert a new `b` after the `c`, over a diff that says to insert a `c` before the `b` and then delete the existing `c`. JsDiff deviated from the published Myers algorithm in a way that led to it having the opposite preference in many cases, including that example. This is now fixed, meaning diffs output by JsDiff will more accurately reflect what the published Myers diff algorithm would output.
18+
- [#455](https://github.com/kpdecker/jsdiff/pull/455) **The `added` and `removed` properties of change objects are now guaranteed to be set to a boolean value.** (Previously, they would be set to `undefined` or omitted entirely instead of setting them to false.)
1119
- [#464](https://github.com/kpdecker/jsdiff/pull/464) Specifying `{maxEditLength: 0}` now sets a max edit length of 0 instead of no maximum.
12-
- [#460](https://github.com/kpdecker/jsdiff/pull/460) Added `oneChangePerToken` option.
13-
- [#467](https://github.com/kpdecker/jsdiff/pull/467) When passing a `comparator(left, right)` to `diffArrays`, values from the old array will now consistently be passed as the first argument (`left`) and values from the new array as the second argument (`right`). Previously this was almost (but not quite) always the other way round.
14-
- [#480](https://github.com/kpdecker/jsdiff/pull/480) Passing `maxEditLength` to `createPatch` & `createTwoFilesPatch` now works properly (i.e. returns undefined if the max edit distance is exceeded; previous behavior was to crash with a `TypeError` if the edit distance was exceeded).
15-
- [#486](https://github.com/kpdecker/jsdiff/pull/486) The `ignoreWhitespace` option of `diffLines` behaves more sensibly now. `value`s in returned change objects now include leading/trailing whitespace even when `ignoreWhitespace` is used, just like how with `ignoreCase` the `value`s still reflect the case of one of the original texts instead of being all-lowercase. `ignoreWhitespace` is also now compatible with `newlineIsToken`. Finally, `diffTrimmedLines` is deprecated (and removed from the docs) in favour of using `diffLines` with `ignoreWhitespace: true`; the two are, and always have been, equivalent.
16-
- [#490](https://github.com/kpdecker/jsdiff/pull/490) When calling diffing functions in async mode by passing a `callback` option, the diff result will now be passed as the *first* argument to the callback instead of the second. (Previously, the first argument was never used at all and would always have value `undefined`.)
17-
- [#489](github.com/kpdecker/jsdiff/pull/489) `this.options` no longer exists on `Diff` objects. Instead, `options` is now passed as an argument to methods that rely on options, like `equals(left, right, options)`. This fixes a race condition in async mode, where diffing behaviour could be changed mid-execution if a concurrent usage of the same `Diff` instances overwrote its `options`.
20+
- [#460](https://github.com/kpdecker/jsdiff/pull/460) **Added `oneChangePerToken` option.**
21+
- [#467](https://github.com/kpdecker/jsdiff/pull/467) **Consistent ordering of arguments to `comparator(left, right)`.** Values from the old array will now consistently be passed as the first argument (`left`) and values from the new array as the second argument (`right`). Previously this was almost (but not quite) always the other way round.
22+
- [#480](https://github.com/kpdecker/jsdiff/pull/480) **Passing `maxEditLength` to `createPatch` & `createTwoFilesPatch` now works properly** (i.e. returns undefined if the max edit distance is exceeded; previous behavior was to crash with a `TypeError` if the edit distance was exceeded).
23+
- [#486](https://github.com/kpdecker/jsdiff/pull/486) **The `ignoreWhitespace` option of `diffLines` behaves more sensibly now.** `value`s in returned change objects now include leading/trailing whitespace even when `ignoreWhitespace` is used, just like how with `ignoreCase` the `value`s still reflect the case of one of the original texts instead of being all-lowercase. `ignoreWhitespace` is also now compatible with `newlineIsToken`. Finally, **`diffTrimmedLines` is deprecated** (and removed from the docs) in favour of using `diffLines` with `ignoreWhitespace: true`; the two are, and always have been, equivalent.
24+
- [#490](https://github.com/kpdecker/jsdiff/pull/490) **When calling diffing functions in async mode by passing a `callback` option, the diff result will now be passed as the *first* argument to the callback instead of the second.** (Previously, the first argument was never used at all and would always have value `undefined`.)
25+
- [#489](github.com/kpdecker/jsdiff/pull/489) **`this.options` no longer exists on `Diff` objects.** Instead, `options` is now passed as an argument to methods that rely on options, like `equals(left, right, options)`. This fixes a race condition in async mode, where diffing behaviour could be changed mid-execution if a concurrent usage of the same `Diff` instances overwrote its `options`.
1826

1927
## v5.2.0
2028

Diff for: src/diff/base.js

+7-20
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Diff.prototype = {
1111
let self = this;
1212

1313
function done(value) {
14+
value = self.postProcess(value, options);
1415
if (callback) {
1516
setTimeout(function() { callback(value); }, 0);
1617
return true;
@@ -41,7 +42,7 @@ Diff.prototype = {
4142
let newPos = this.extractCommon(bestPath[0], newString, oldString, 0, options);
4243
if (bestPath[0].oldPos + 1 >= oldLen && newPos + 1 >= newLen) {
4344
// Identity per the equality and tokenizer
44-
return done(buildValues(self, bestPath[0].lastComponent, newString, oldString, self.useLongestToken, options));
45+
return done(buildValues(self, bestPath[0].lastComponent, newString, oldString, self.useLongestToken));
4546
}
4647

4748
// Once we hit the right edge of the edit graph on some diagonal k, we can
@@ -105,7 +106,7 @@ Diff.prototype = {
105106

106107
if (basePath.oldPos + 1 >= oldLen && newPos + 1 >= newLen) {
107108
// If we have hit the end of both strings, then we are done
108-
return done(buildValues(self, basePath.lastComponent, newString, oldString, self.useLongestToken, options));
109+
return done(buildValues(self, basePath.lastComponent, newString, oldString, self.useLongestToken));
109110
} else {
110111
bestPath[diagonalPath] = basePath;
111112
if (basePath.oldPos + 1 >= oldLen) {
@@ -209,10 +210,13 @@ Diff.prototype = {
209210
},
210211
join(chars) {
211212
return chars.join('');
213+
},
214+
postProcess(changeObjects) {
215+
return changeObjects;
212216
}
213217
};
214218

215-
function buildValues(diff, lastComponent, newString, oldString, useLongestToken, options) {
219+
function buildValues(diff, lastComponent, newString, oldString, useLongestToken) {
216220
// First we convert our linked list of components in reverse order to an
217221
// array in the right order:
218222
const components = [];
@@ -256,22 +260,5 @@ function buildValues(diff, lastComponent, newString, oldString, useLongestToken,
256260
}
257261
}
258262

259-
// Special case handle for when one terminal is ignored (i.e. whitespace).
260-
// For this case we merge the terminal into the prior string and drop the change.
261-
// This is only available for string mode.
262-
let finalComponent = components[componentLen - 1];
263-
if (
264-
componentLen > 1
265-
&& typeof finalComponent.value === 'string'
266-
&& (
267-
(finalComponent.added && diff.equals('', finalComponent.value, options))
268-
||
269-
(finalComponent.removed && diff.equals(finalComponent.value, '', options))
270-
)
271-
) {
272-
components[componentLen - 2].value += finalComponent.value;
273-
components.pop();
274-
}
275-
276263
return components;
277264
}

0 commit comments

Comments
 (0)