You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bytecode compilation is slow. It's often one of the biggest contributors
to the install step's sluggishness. For better or worse, we can't really
enable --no-compile by default as it has the potential to render certain
workflows permanently slower in a subtle way.[^1]
To improve the situation, bytecode compilation can be parallelized
across a pool of processes (or sub-interpreters on Python 3.14). I've
observed a 1.1x to 3x improvement in install step times locally.[^2]
This patch has been written to be relatively comprehensible, but for
posterity, these are the high-level implementation notes:
- We can't use compileall.compile_dir() because it spins up a new worker
pool on every invocation. If it's used as a "drop-in" replacement for
compileall.compile_file(), then the pool creation overhead will be
paid for every package installed. This is bad and kills most of the
gains. Redesigning the installation logic to compile everything at the
end was rejected for being too invasive (a key goal was to avoid
affecting the package installation order).
- A bytecode compiler is created right before package installation
starts and reused for all packages. Depending on platform and
workload, either a serial (in-process) compiler or parallel compiler
will be used. They both have the same interface, accepting a batch of
Python filepaths to compile.
- This patch was designed to as low-risk as reasonably possible. pip
does not contain any parallelized code, thus introducing any sort of
parallelization poses a nontrivial risk. To minimize this risk, the
only code parallelized is the bytecode compilation code itself (~10
LOC). In addition, the package install order is unaffected and pip
will fall back to serial compilation if parallelization is unsupported.
The criteria for parallelization are:
1. There are at least 2 CPUs available. The process CPU count is used
if available, otherwise the system CPU count. If there is only one
CPU, serial compilation will always be used because even a parallel
compiler with one worker will add extra overhead.
2. The maximum amount of workers is at least 2. This is controlled by
the --install-jobs option.[^3] It defaults to "auto" which uses the
process/system CPU count.[^4]
3. There is "enough" code for parallelization to be "worth it". This
criterion exists so pip won't waste (say) 100ms on spinning up a
parallel compiler when compiling serially would only take 20ms.[^5]
The limit is set to 1 MB of Python code. This is admittedly rather
crude, but it seems to work well enough having tested on a variety of
systems.
[^1]: Basically, if the Python files are installed to a read-only
directory, then importing those files will be permanently slower
as the .pyc files will never be cached. This is quite subtle,
enough so that we can't really expect newbies to recognise and
know how to address this (there is the PYTHONPYCACHEPREFIX
envvar, but if you're advanced enough to use it, then you are
also advanced enough to know when to use uv or pip's
--no-compile).
[^2]: The 1.1x was on a painfully slow dual-core/HDD-equipped Windows
install installing simply setuptools. The 3x was observed on my
main 8-core Ryzen 5800HS Windows machine while installing pip's
own test dependencies.
[^3]: Yes, this is probably not the best name, but adding an option for
just bytecode compilation seems silly. Anyway, this will give us
room if we ever parallelize more parts of the install step.
[^4]: Up to a hard-coded limit of 8 to avoid resource exhaustion. This
number was chosen arbitrarily, but is definitely high enough to
net a major improvement.
[^5]: This is important because I don't want to slow down tiny installs
(e.g., pip install six ... or our own test suite). Creating a new
process is prohibitively expensive on Windows (and to a lesser
degree on macOS) for various reasons, so parallelization can't be
simply used all of time.
0 commit comments