Excessive memory usage, part 2 #265

itamarst · 2021-11-03T13:59:19Z

I'm testing new cachecontrol 0.12.8 with pip, and it only partially solves the excess memory usage: memory usage goes down from 1300MB to 900MB when doing pip install tensorflow.

The next bottleneck is the fact msgpack doesn't have a streaming interface, so packing always makes a copy of the data in memory. (I believe something like this was mentioned as a potential issue in the previous issues #145, but unfortunately the test script I was using didn't use that code path).

Some potential approaches:

Modify msgpack upstream to support streaming writes, perhaps coupled with streaming API on cachecontrol side.
Switch to a new data format where instead of having one giant bytestring, the msgpack is a series of bytestrings. This requires no changes to msgpack, and perhaps no public API changes assuming more mmap hackery.
Re-implement msgpack serialization just for this use case; essentially the code only packs a byte string, so just need to write the appropriate header beforehand. Depending on what APIs are public this may be the least intrusive option; hacky, but it'll work.
No doubt others (e.g. some other data format).

The text was updated successfully, but these errors were encountered:

itamarst · 2021-11-03T14:00:31Z

Related is pypa/pip#9549, where someone tried to cache something bigger than msgpack message size. Solution 2 above would solve that too.

itamarst · 2021-11-03T14:11:37Z

I will think about the best/easiest approach and submit a PR at some point soon, but if you have a preference/suggestion/other ideas, let me know.

ionrock · 2021-11-03T16:09:08Z

@itamarst I don't have a strong opinion offhand. I suspect trying to get some changes into msgpack might be a long process. I'd wonder if there is a way to keep the metadata separate and just maintain a file manually to support the streaming?

The other data format that I'm familiar with is Protobuf, but it doesn't have support for streaming after some quick research. There is Thrift to look into as well I suppose, but I'm not really sure if there are others to be considered.

itamarst · 2021-11-04T14:25:41Z

Oh I hadn't read the part of the code where it actually constructs a dict with metadata. So it's not just the body in there...

itamarst · 2021-11-04T14:28:46Z

There's... pickle. But that's a problematic format in many ways.

itamarst · 2021-11-04T14:53:45Z

Updating as I learn:

CacheControl 0.12.6 memory usage was 3× size of cached value. 0.12.7 and later are 2×-ish: msgpack dump, and then an extra copy to add the version prefix. I say "ish" because there's an extra copy in a temporary file, and some of that may be in memory and some if may be dumped to disk if there is memory pressure.

So we went from 3× memory, 1× disk copies → 2x memory, 2× disk copies.

Conceivably I could switch to 1× memory, 3× disk copies 😬 with another mmap(). But that starts feeling even more hacked together.

itamarst · 2021-11-04T14:57:21Z

Really we want constant memory usage; O(N) is a lot better than O(3N), but it is still pretty bad when N might be as high as 800MB (https://pypi.org/project/torch/#files). But you can't get to constant (low) memory usage without changing API design quite a lot. But perhaps it could be changed in backwards compatible way...

itamarst · 2021-11-05T18:04:00Z

This is an interesting design problem 😢.

itamarst · 2021-11-05T19:12:38Z

OK I think I have a design that works; there is a legacy cache API and a new cache API, and the new cache API stores metadata and body separately.

itamarst · 2021-11-05T19:18:08Z

I wonder if it's worth supporting old existing on-disk caches though, or just repopulating... It will add a bunch of complexity given the change in file format is so significant.

itamarst · 2021-11-05T19:23:49Z

I guess I will have to preserve old FileCache, at least, for backwards compat on Python API level, and FileCacheV2 will not support existing cached on-disk data.

itamarst · 2021-11-05T19:24:22Z

You can see WIP here: https://github.com/ionrock/cachecontrol/compare/master...itamarst:265-memory-use-2?expand=1

ionrock · 2021-11-06T03:11:35Z

@itamarst Just to think aloud for a sec, the problem seems that when we're caching, we've already parsed and dumped the body. I'd wonder if there was a lazy cache that could accept a callable as the data that can be smart about how things get serialized on the fly.

If you went this route, you'd likely need to refactor the base CacheController to allow some configuration on how you call cache.set, but you could pass in that LazyCacheController when instantiating the adapter. Similarly, you could also couple that with a LazyCache interface that expects something other than a string effectively as the data to store in the cache.

The general strategy I'm thinking about is how can you inject the behavior you want without risking breaking the interface. While I can't think of another style of cache, that doesn't mean someone else won't over time and this might make the practice of optimizing for memory vs. disk vs. ??? easier to deal with by others.

Just throwing it out there! Thanks for looking at this!

itamarst mentioned this issue Nov 3, 2021

Excessive memory use when caching large packages pypa/pip#2984

Closed

itamarst mentioned this issue Nov 11, 2021

Work in progress, opened for initial API review: streaming APIs #267

Closed

itamarst mentioned this issue Nov 30, 2021

Reduce memory use, part 2 #271

Merged

ionrock closed this as completed in #271 Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excessive memory usage, part 2 #265

Excessive memory usage, part 2 #265

itamarst commented Nov 3, 2021

itamarst commented Nov 3, 2021

Uh oh!

itamarst commented Nov 3, 2021

Uh oh!

ionrock commented Nov 3, 2021

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 4, 2021 •

edited

Loading

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

ionrock commented Nov 6, 2021

Uh oh!

Excessive memory usage, part 2 #265

Excessive memory usage, part 2 #265

Comments

itamarst commented Nov 3, 2021

itamarst commented Nov 3, 2021

Uh oh!

itamarst commented Nov 3, 2021

Uh oh!

ionrock commented Nov 3, 2021

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itamarst commented Nov 4, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

itamarst commented Nov 5, 2021

Uh oh!

ionrock commented Nov 6, 2021

Uh oh!

itamarst commented Nov 4, 2021 •

edited

Loading