Skip to content

gh-88339: enable fast seeking of uncompressed unencrypted zipfile.ZipExtFile #27737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 6, 2022
Merged

gh-88339: enable fast seeking of uncompressed unencrypted zipfile.ZipExtFile #27737

merged 7 commits into from
Aug 6, 2022

Conversation

JuniorJPDJ
Copy link
Contributor

@JuniorJPDJ JuniorJPDJ commented Aug 12, 2021

At the moment stored ZipExtFile is being read to the place of seek like all other compressed variants.
It's not needed as it's possible to freely seek uncompressed file inside zip without this penalty.

Lots of apps depend on ZipExtFile seeking ability and this patch would increase performance and lower IO penalty significantly.

It disables CRC checking after first seek as it's impossible to check CRC if we are not reading whole file.

I've been using patched zipfile for almost a year now and I can say it works good ;)

https://bugs.python.org/issue44173 / #88339

redo of #26227 after a little fail

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test_zipfile.py unittest explicitly cover this case of seeking within a ZIP_STORED entry? If so, mention that and drop a link to where in a comment here.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding an explicit seek + read test case to Lib/test/test_zipfile.py's StoredTestsWithSourceFile class that checks that the data read matches what is expected at the seek'ed to offset?

@JuniorJPDJ
Copy link
Contributor Author

JuniorJPDJ commented Aug 19, 2021

I don't get why this test fails. Do you maybe have any idea?
EDIT: fixed

@JuniorJPDJ
Copy link
Contributor Author

How about adding an explicit seek + read test case to Lib/test/test_zipfile.py's StoredTestsWithSourceFile class that checks that the data read matches what is expected at the seek'ed to offset?

Will try to do, thanks!

@serhiy-storchaka
Copy link
Member

There is a unittest explicitly covering this case. The one which fails.

@JuniorJPDJ
Copy link
Contributor Author

JuniorJPDJ commented Aug 26, 2021

There is a unittest explicitly covering this case. The one which fails.

You are right! It's that unit test. I added one assert to it.

There were problems with read buffer which I fixed now. I rebased and added fixup! commit without amending previous one to make review easier. Just please, squash them when merging.

Please run CI and tell me if I'm still missing something. ;)

BTW. Thanks for your work guys!

@JuniorJPDJ JuniorJPDJ requested a review from gpshead August 26, 2021 01:44
@JuniorJPDJ JuniorJPDJ changed the title bpo-44173: better approach for seeking in non-compressed ZipExtFile bpo-44173: [zipfile] better approach for seeking in non-compressed ZipExtFile Aug 26, 2021
@JuniorJPDJ JuniorJPDJ changed the title bpo-44173: [zipfile] better approach for seeking in non-compressed ZipExtFile bpo-44173: enable fast seeking of uncompressed unencrypted zipfile.ZipExtFile Aug 27, 2021
@JuniorJPDJ
Copy link
Contributor Author

JuniorJPDJ commented Aug 27, 2021

Previous tests passed (some asyncio on windows x64 failed, but it's probably not my fault).
I just rebased and added news entry - from my side it's ready to merge. Am I missing something?

@JuniorJPDJ
Copy link
Contributor Author

JuniorJPDJ commented Sep 19, 2021

Can I ask for review please?

EDIT: Just rebased my branch.

@JuniorJPDJ JuniorJPDJ changed the title bpo-44173: enable fast seeking of uncompressed unencrypted zipfile.ZipExtFile gh-88339: enable fast seeking of uncompressed unencrypted zipfile.ZipExtFile Apr 17, 2022
@JuniorJPDJ
Copy link
Contributor Author

I have made the requested changes; please review again

(test is already here, I added some tell asserts)

@bedevere-bot
Copy link

Thanks for making the requested changes!

@gpshead: please review the changes made to this pull request.

@JuniorJPDJ JuniorJPDJ requested a review from gpshead April 22, 2022 12:48
@JuniorJPDJ
Copy link
Contributor Author

It looks like I fixed all issues you guys found! :D

@JuniorJPDJ
Copy link
Contributor Author

Oh and I again forgot to tag you guys: @gpshead, @serhiy-storchaka

@JuniorJPDJ
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@gpshead: please review the changes made to this pull request.

@JuniorJPDJ
Copy link
Contributor Author

Guys, what's wrong. It's over a year here now, I fixed everything you mentioned. Can I do something more to make this faster?

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could use more explicit testing around all possible seek edge cases, but that is true in general of the zipfile module and not specific to this PR. It appears to do the right thing and the existing tests that hopefully represent the more likely usages continue to pass.

@gpshead
Copy link
Member

gpshead commented Jun 15, 2022

If I haven't merged this within a few days, please ping me again. I'm giving Serhiy some time to respond if desired.

@JuniorJPDJ
Copy link
Contributor Author

JuniorJPDJ commented Jun 22, 2022

If I haven't merged this within a few days, please ping me again.

@gpshead ;)

Also I've a question - merging it now to main branch means this changes will be available from python 3.12, not 3.11?

@JuniorJPDJ
Copy link
Contributor Author

If I haven't merged this within a few days, please ping me again.

@gpshead ;)

@JuniorJPDJ
Copy link
Contributor Author

ping @gpshead

@JuniorJPDJ
Copy link
Contributor Author

ping @gpshead

@JuniorJPDJ
Copy link
Contributor Author

bump @gpshead

@gpshead gpshead merged commit 330f1d5 into python:main Aug 6, 2022
@gpshead
Copy link
Member

gpshead commented Aug 6, 2022

yep, this is in 3.12. the feature freeze window for a release closes upon the first beta.

@gpshead
Copy link
Member

gpshead commented Aug 6, 2022

thanks for the improvement!

@JuniorJPDJ JuniorJPDJ deleted the JuniorJPDJ/zipfile-store-seek branch August 7, 2022 18:43
iritkatriel pushed a commit to iritkatriel/cpython that referenced this pull request Aug 11, 2022
…le.ZipExtFile (pythonGH-27737)

Avoid reading all of the intermediate data in uncompressed items in a zip file when the user seeks forward.

Contributed by: @JuniorJPDJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants