Skip to content

Excessive memory usage, part 2 #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
itamarst opened this issue Nov 3, 2021 · 13 comments · Fixed by #271
Closed

Excessive memory usage, part 2 #265

itamarst opened this issue Nov 3, 2021 · 13 comments · Fixed by #271

Comments

@itamarst
Copy link
Contributor

itamarst commented Nov 3, 2021

I'm testing new cachecontrol 0.12.8 with pip, and it only partially solves the excess memory usage: memory usage goes down from 1300MB to 900MB when doing pip install tensorflow.

The next bottleneck is the fact msgpack doesn't have a streaming interface, so packing always makes a copy of the data in memory. (I believe something like this was mentioned as a potential issue in the previous issues #145, but unfortunately the test script I was using didn't use that code path).

Some potential approaches:

  1. Modify msgpack upstream to support streaming writes, perhaps coupled with streaming API on cachecontrol side.
  2. Switch to a new data format where instead of having one giant bytestring, the msgpack is a series of bytestrings. This requires no changes to msgpack, and perhaps no public API changes assuming more mmap hackery.
  3. Re-implement msgpack serialization just for this use case; essentially the code only packs a byte string, so just need to write the appropriate header beforehand. Depending on what APIs are public this may be the least intrusive option; hacky, but it'll work.
  4. No doubt others (e.g. some other data format).
@itamarst
Copy link
Contributor Author

itamarst commented Nov 3, 2021

Related is pypa/pip#9549, where someone tried to cache something bigger than msgpack message size. Solution 2 above would solve that too.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 3, 2021

I will think about the best/easiest approach and submit a PR at some point soon, but if you have a preference/suggestion/other ideas, let me know.

@ionrock
Copy link
Contributor

ionrock commented Nov 3, 2021

@itamarst I don't have a strong opinion offhand. I suspect trying to get some changes into msgpack might be a long process. I'd wonder if there is a way to keep the metadata separate and just maintain a file manually to support the streaming?

The other data format that I'm familiar with is Protobuf, but it doesn't have support for streaming after some quick research. There is Thrift to look into as well I suppose, but I'm not really sure if there are others to be considered.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 4, 2021

Oh I hadn't read the part of the code where it actually constructs a dict with metadata. So it's not just the body in there...

@itamarst
Copy link
Contributor Author

itamarst commented Nov 4, 2021

There's... pickle. But that's a problematic format in many ways.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 4, 2021

Updating as I learn:

CacheControl 0.12.6 memory usage was 3× size of cached value. 0.12.7 and later are 2×-ish: msgpack dump, and then an extra copy to add the version prefix. I say "ish" because there's an extra copy in a temporary file, and some of that may be in memory and some if may be dumped to disk if there is memory pressure.

So we went from 3× memory, 1× disk copies → 2x memory, 2× disk copies.

Conceivably I could switch to 1× memory, 3× disk copies 😬 with another mmap(). But that starts feeling even more hacked together.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 4, 2021

Really we want constant memory usage; O(N) is a lot better than O(3N), but it is still pretty bad when N might be as high as 800MB (https://pypi.org/project/torch/#files). But you can't get to constant (low) memory usage without changing API design quite a lot. But perhaps it could be changed in backwards compatible way...

@itamarst
Copy link
Contributor Author

itamarst commented Nov 5, 2021

This is an interesting design problem 😢.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 5, 2021

OK I think I have a design that works; there is a legacy cache API and a new cache API, and the new cache API stores metadata and body separately.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 5, 2021

I wonder if it's worth supporting old existing on-disk caches though, or just repopulating... It will add a bunch of complexity given the change in file format is so significant.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 5, 2021

I guess I will have to preserve old FileCache, at least, for backwards compat on Python API level, and FileCacheV2 will not support existing cached on-disk data.

@itamarst
Copy link
Contributor Author

itamarst commented Nov 5, 2021

@ionrock
Copy link
Contributor

ionrock commented Nov 6, 2021

@itamarst Just to think aloud for a sec, the problem seems that when we're caching, we've already parsed and dumped the body. I'd wonder if there was a lazy cache that could accept a callable as the data that can be smart about how things get serialized on the fly.

If you went this route, you'd likely need to refactor the base CacheController to allow some configuration on how you call cache.set, but you could pass in that LazyCacheController when instantiating the adapter. Similarly, you could also couple that with a LazyCache interface that expects something other than a string effectively as the data to store in the cache.

The general strategy I'm thinking about is how can you inject the behavior you want without risking breaking the interface. While I can't think of another style of cache, that doesn't mean someone else won't over time and this might make the practice of optimizing for memory vs. disk vs. ??? easier to deal with by others.

Just throwing it out there! Thanks for looking at this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants