Skip to content

Improve time complexity of tokenization regex used in diffSentences #580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ExplodingCabbage
Copy link
Collaborator

@ExplodingCabbage ExplodingCabbage commented Feb 8, 2025

The performance bug here was essentially the same as what I describe at https://markamery.com/blog/quadratic-time-regexes/; essentially, doing a regex search using a regex that starts with a pattern repeated with + or * is by default a quadratic time operation, unless you put a lookbehind before it, which we weren't doing.

While adding tests to check I wasn't breaking anything with this fix, I realised some existing behaviour was perverse, so settled on a fix that changed some of the perverse behaviour while also fixing the "ReDOS" issue (Snyk reported it via email as such).

(As with many "ReDOS" "vulns", I have conflicting & complicated thoughts on whether it's reasonable to consider this a security vulnerability in the first place, as Snyk do. That question of labeling doesn't change that it's bad and should be fixed, though.)

@ExplodingCabbage ExplodingCabbage self-assigned this Feb 8, 2025
@ExplodingCabbage ExplodingCabbage marked this pull request as ready for review February 14, 2025 16:26
@ExplodingCabbage ExplodingCabbage merged commit 4c8f444 into master Feb 14, 2025
@ExplodingCabbage ExplodingCabbage deleted the fix-bad-time-complexity-of-diffSentences-regex branch February 14, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant