-
Notifications
You must be signed in to change notification settings - Fork 128
Excessive memory usage, part 2 #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related is pypa/pip#9549, where someone tried to cache something bigger than msgpack message size. Solution 2 above would solve that too. |
I will think about the best/easiest approach and submit a PR at some point soon, but if you have a preference/suggestion/other ideas, let me know. |
@itamarst I don't have a strong opinion offhand. I suspect trying to get some changes into The other data format that I'm familiar with is Protobuf, but it doesn't have support for streaming after some quick research. There is Thrift to look into as well I suppose, but I'm not really sure if there are others to be considered. |
Oh I hadn't read the part of the code where it actually constructs a dict with metadata. So it's not just the body in there... |
There's... pickle. But that's a problematic format in many ways. |
Updating as I learn: CacheControl 0.12.6 memory usage was 3× size of cached value. 0.12.7 and later are 2×-ish: So we went from 3× memory, 1× disk copies → 2x memory, 2× disk copies. Conceivably I could switch to 1× memory, 3× disk copies 😬 with another mmap(). But that starts feeling even more hacked together. |
Really we want constant memory usage; O(N) is a lot better than O(3N), but it is still pretty bad when N might be as high as 800MB (https://pypi.org/project/torch/#files). But you can't get to constant (low) memory usage without changing API design quite a lot. But perhaps it could be changed in backwards compatible way... |
This is an interesting design problem 😢. |
OK I think I have a design that works; there is a legacy cache API and a new cache API, and the new cache API stores metadata and body separately. |
I wonder if it's worth supporting old existing on-disk caches though, or just repopulating... It will add a bunch of complexity given the change in file format is so significant. |
I guess I will have to preserve old |
@itamarst Just to think aloud for a sec, the problem seems that when we're caching, we've already parsed and dumped the body. I'd wonder if there was a lazy cache that could accept a callable as the data that can be smart about how things get serialized on the fly. If you went this route, you'd likely need to refactor the base The general strategy I'm thinking about is how can you inject the behavior you want without risking breaking the interface. While I can't think of another style of cache, that doesn't mean someone else won't over time and this might make the practice of optimizing for memory vs. disk vs. ??? easier to deal with by others. Just throwing it out there! Thanks for looking at this! |
I'm testing new cachecontrol 0.12.8 with pip, and it only partially solves the excess memory usage: memory usage goes down from 1300MB to 900MB when doing
pip install tensorflow
.The next bottleneck is the fact
msgpack
doesn't have a streaming interface, so packing always makes a copy of the data in memory. (I believe something like this was mentioned as a potential issue in the previous issues #145, but unfortunately the test script I was using didn't use that code path).Some potential approaches:
msgpack
upstream to support streaming writes, perhaps coupled with streaming API on cachecontrol side.msgpack
, and perhaps no public API changes assuming moremmap
hackery.msgpack
serialization just for this use case; essentially the code only packs a byte string, so just need to write the appropriate header beforehand. Depending on what APIs are public this may be the least intrusive option; hacky, but it'll work.The text was updated successfully, but these errors were encountered: