Skip to content

Improve speed of stdlib functions by replacing re uses #130167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
donBarbos opened this issue Feb 16, 2025 · 7 comments
Open

Improve speed of stdlib functions by replacing re uses #130167

donBarbos opened this issue Feb 16, 2025 · 7 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@donBarbos
Copy link
Contributor

donBarbos commented Feb 16, 2025

We can often find the module re in the standard library modules but it can be replaced (if it is possible). I don't suggest removing it everywhere, there are places where its use is appropriate, but there are also places where it is an unnecessary solution and leads to unpleasant consequences (they can be found below)

Cons of regular expressions and reasons to replace regular expressions with functions and methods:

  1. We spend time to compile re pattern (one time, but anyway we spend it)
  2. In most cases simple string methods are faster (according to my benchmarks about 2x)
  3. We can remove import re which will affect import time
  4. Additionally: I think for those who don't know regular expressions, the code is more difficult to read and therefore difficult to maintain.

Important

For those who want to work on the issue, please:

  • Read https://devguide.python.org/getting-started/pull-request-lifecycle/ before anything else.
  • Select one function to improve. It's easier to review and possibly backport.
  • Always report benchmarks using pyperf, hyperfine, and tuna together with -X importtime to compare import times and execution time.
  • Open a pull request with the following title: gh-130167: Improve speed of `module.function` by replacing `re`

Linked PRs

@terryjreedy terryjreedy changed the title Improve speed of functions that use re of various stdlib modules Improve speed of functions by replacing re uses Feb 16, 2025
@terryjreedy terryjreedy added performance Performance or resource usage stdlib Python modules in the Lib dir labels Feb 16, 2025
@terryjreedy terryjreedy changed the title Improve speed of functions by replacing re uses Improve speed of stdlib functions by replacing re uses Feb 16, 2025
@terryjreedy terryjreedy added the pending The issue will be closed if no feedback is provided label Feb 16, 2025
@terryjreedy
Copy link
Member

  1. RE patterns are compiled once and then cached. So less relevant the more times a pattern is used.
  2. Please list RE patterns, replacements, and RE timings that do and do not include compile time. This will indicate max and min time saved, depending on reuses. Should a table be added to Regular Expression HOWTO, Use String Methods. Are there string methods other than replace and translate to consider?
  3. This is known, and there have been cases where we delay re imports until needed, if ever, to reduce startup time. But re import cannot be removed entirely unless all uses are removed.
  4. Core developers should know regular expressions, though the point applies to some other contributors.

Unless you find and list examples of useful replacements in an stdlib module, this is not properly an issue. If you have or do, make a PR. Or suggest a revision of the HOWTO section.

@donBarbos
Copy link
Contributor Author

@terryjreedy thank you for your comment and I know about the counterarguments you brought up but it doesn't change the increase in speed and the fact that we can get rid of unnecessary imports (sometimes or often, I don't know the statistics yet :)

I've only submitted one PR so far but I'm open to finding more examples since we use re in the stdlib all the time

@picnixz
Copy link
Member

picnixz commented Feb 16, 2025

I removed the paragraph for new contributions as this issue is not necessary something we want to address immediately. Before opening a PR, benchmarks must be given and be convincing. If we lose the readability at the cost of an improved import time, I'm not sure it's worth unless we're like 5 or 6 times faster and if it's an important module that is usually imported and that re is not usually imported.

@donBarbos
Copy link
Contributor Author

donBarbos commented Feb 16, 2025

If we lose the readability at the cost of an improved import time, I'm not sure it's worth unless we're like 5 or 6 times faster and if it's an important module that is usually imported and that re is not usually imported.

Usually, i think, the readability of code without regular expressions is better. And from a performance standpoint, I think the speed of the function is the main improvement, although import time is also important.

you can find import re very often and only one use

@picnixz
Copy link
Member

picnixz commented Feb 16, 2025

If you can remove re with simpler alternatives, then I'm fine. What I don't want to see is essentially PRs that would move lots of re imports to local ones if they hurt readability and don't improve start up time by much.

Note that it's also important to possibly plan for future extensions of the regex usage. Like getting the matched substring could be useful later. Now, I'm not against improving the various re usages whenever possible but benchmarks must always be given (with sufficient coverage).

@picnixz
Copy link
Member

picnixz commented Feb 16, 2025

I've restored your guidelines but added the requirements of having benchmarks

@picnixz picnixz removed the pending The issue will be closed if no feedback is provided label Feb 16, 2025
@picnixz
Copy link
Member

picnixz commented Feb 16, 2025

Considering I wrote #128983, I think we can treat it as a legit improvement. But let's only focus on changing re usages, not changing the imports or something else.

@picnixz picnixz added the type-feature A feature request or enhancement label Feb 16, 2025
AA-Turner added a commit that referenced this issue Mar 31, 2025
Co-authored-by: Marius Juston <[email protected]>
Co-authored-by: Pieter Eendebak <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
Co-authored-by: Marius Juston <[email protected]>
Co-authored-by: Pieter Eendebak <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants