gh-129205: Experiment BytesIO._readfrom() #130098

cmaloney · 2025-02-13T20:55:23Z

Draft PR / Experiment

Rather than directly moving loops, I have been experimenting with a BytesIO._readfrom(file, /, *, estimate=None, limit=None) that encapsulates the buffer resizing efficiently as well as the read loop, adding common code for features "estimated size" and "limit size". It can be used to implement FileIO.readall with minimal perf change (is in PR). In general I think "If there is a read loop, it should be faster and simpler".

The _pyio implementation supports for three kinds of IO objects: Direct FD ints, those with a .readinto member, and those with .read member. If that looks like a reasonable approach, I'd likely introduce it as a internal method BytesIO._readfrom() and move cases (with perf tests to make sure things don't regress).

In the C implementation I included an optimization to avoid a heap allocation by using 1KB of stack space in the estimate=0 case instead. Not sure that's worth the complexity and cost (if it gets used, need an extra copy compared to just using a bytes; and a warmed up interpreter feels like 1KB PyBytes is likely to be quickly available / allocated).

The CPython codebase has a common pattern of build a list of I/O chunks than "join" them together at the end of the loop. I think readfrom makes a tradeoff in that case, in that as long as resize infrequently copies (I think not lots of other memory buffers being allocated), it should be faster than that single extra large join and copy at the end. I haven't run full performance numbers though. In my mental model using non-linear buffer resizing for large readall is likely a much bigger performance gain and reducing number of allocs + deallocs, than potential realloc copies; definitely uses less memory overall.

Issue: Add os.readinto API for reading data into a caller provided buffer #129205

cmaloney · 2025-02-13T22:31:13Z

Lib/_pyio.py

+        if estimate is not None:
+            target_read = int(estimate) + 1
+        else:
+            target_read = DEFAULT_BUFFER_SIZE


nit: max(DEFAULT_BUFFER_SIZE, len(self._buffer) - self._pos) / if there's already lots of space use it.

vstinner · 2025-02-20T16:17:00Z

I don't understand well the purpose of this change. If it's an optimization, do you have benchmark results showing that it makes a difference?

cmaloney · 2025-02-20T16:51:14Z

Couple motivators for me:

It refactors to de-duplicate code in CPython, currently readall and return bytes is re-implemented regularly with varying features (to end, loop but cap size, estimated size).
Provide a standard, and cheaper than io.open() stack way to read data from a file-like object. Currently code can't get the features it wants and tends to hand-roll a loop around os.read with a fd.
It should make individual cases faster, provide centralized place to tune.
a. It generally should have less copies (readinto + resize vs. append) / memory usage vs. call read() and append bytes loops. Hand-writing readinto loops definitely complicates code, vs. one efficient core implementation.
b. Non-linear buffer resizing should be significantly faster than constant buffer size as size gets larger
c. Should be faster than build list of bytes followed by one large alloc + copy at end ("".join(list_of_bytes) pattern).

I'll work on measuring and open a more directed issue if the performance delta is significant between the common code patterns today.

cmaloney added 7 commits February 10, 2025 19:13

WIP: Pyio readfrom

e350291

Working

fd457a8

WIP: Moving towards FileIO.readall using BytesIO.readfrom

64abc84

WIP: FileIO.readall -> BytesIO._readfrom

5f6bee9

io.FileIO.readall working via io.BytesIO.readall

258025d

Non-blocking readfrom working

4b3664f

WIP: readfrom for general file objects with readinto, read

b3ca823

bedevere-app bot mentioned this pull request Feb 13, 2025

Add os.readinto API for reading data into a caller provided buffer #129205

Closed

6 tasks

tweak comments

deb12cc

cmaloney commented Feb 13, 2025

View reviewed changes

cmaloney closed this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-129205: Experiment BytesIO._readfrom() #130098

gh-129205: Experiment BytesIO._readfrom() #130098

cmaloney commented Feb 13, 2025 •

edited by bedevere-app bot

Loading

cmaloney Feb 13, 2025

vstinner commented Feb 20, 2025

cmaloney commented Feb 20, 2025 •

edited

Loading

gh-129205: Experiment BytesIO._readfrom() #130098

gh-129205: Experiment BytesIO._readfrom() #130098

Conversation

cmaloney commented Feb 13, 2025 • edited by bedevere-app bot Loading

Draft PR / Experiment

cmaloney Feb 13, 2025

Choose a reason for hiding this comment

vstinner commented Feb 20, 2025

cmaloney commented Feb 20, 2025 • edited Loading

cmaloney commented Feb 13, 2025 •

edited by bedevere-app bot

Loading

cmaloney commented Feb 20, 2025 •

edited

Loading