-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Faster cache deserialization #3456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Random idea: Could we use a different format than JSON, perhaps marshal? |
JSON serialization itself is not the bottleneck. The json stdlib module uses highly optimized C code when (de)serializing using default options, which we are careful to use unless you pass |
Another random idea (probably I have mention it before): use (Also I think stdlib |
Hm, but the |
Yes, but on the other hand, some of the methods could be simpler, since intermediate |
Just another random idea. I have heard from a friend who works on a project that has a feature very similar to our I understand that the bottleneck is probably not parsing JSON but creating actual instances, but I have heard that SQL supports self-referential data structures, so that maybe we can save some time also in |
Decreasing priority for now. We'll be focusing on fine-grained incremental checking first. |
Is this still a concern? I believe we now have a SQL store for the cache as an option. If it is still a concern, I do like the idea of using |
We now have an option of using sqlite for storing cache information, which is helpful. Deserialization performance is still an issue for large builds, though. Using Another idea not discussed above is to deserialize cache files lazily. For example, consider a module |
Our use of SQL is totally half-assed, though. We have a table with filename, data, mtime columns and we store our json blobs in it. |
Hi, is there any way to implement custom serialization/deserialization like pickle? In our case loading caches from disk takes almost as long as generating them anew. |
Another option is to write a custom binary serializer, but it's kind of awkward to implement right now. Once mypyc adds better support for encoding and decoding binary data, this should be feasible, and it would be a real-world use case for the relevant, though currently still hypothetical, mypyc functionality. This would have to be designed carefully to avoid complicating the maintenance of mypy serialization code, if we decide to take this route. Faster cache deserialization could help once we have parallel type checking (#933), at least if parallel workers would use the mypy cache to share data with each other. Cache size reduction is a related issue (#15731). |
A faster, custom binary cache format is a promising way to speed up deserialization. It can be combined with some of the other ideas discussed above. Constructing objects is likely a big bottleneck. A JSON parser constructs a string object for each dictionary key, and these are discarded afterwards. Also, many of the list and dictionary objects produced during JSON parsing are discarded during deserialization. A custom deserialization format would let us avoid most of this overhead. I did a small experiment, which suggests that JSON parsing likely accounts for roughly half of the time spent in deserialization during self checking. Another interesting observation is that most of the strings in the data occur at least twice per file. I will explain my proposal for the binary format next. Binary cache formatEach serialized value can be encoded as a 8-bit type tag followed by type-specific binary data. Values can be either primitives (int, float, str, bool, None, list, or dict) or instances (e.g. TupleType). Each primitive type has a separate encoding. For example, int would be encoded as a variable-length sequency of bytes. For example, if the high bit is set, additional bytes will follow. Instance data has a common encoding as (name, value) fields that represent attributes. Attributes are encoded as a single byte tag representing the name, followed by the value. The value uses the general value encoding. The (name, value) fields must be in a specific order, but some may be omitted, depending on the values of earlier attributes. The field name tags are shared between instance types, so that each instance type with a String encodingSince most string values tend to have duplicates in the same cache file, sharing string objects would improve efficiency. Each string would either be encoded as length followed by contents, or as a reference to a previously decoded string. Using references to strings lets us share string objects, which would both speed up decoding and reduce memory use, at the cost of slightly slower encoding. It may also be worth it to have an efficient encoding of shared module name prefixes of fullnames, since many fullnames have long -- but shared -- module name prefixes. This doesn't seem essential, though. AnalysisThis format has some nice features:
There are some drawbacks and limitations as well:
Implementation sketchModule # Field tags (must be unique and in range 0..255)
name: Final = 0
line: Final = 1
items: Final = 2
... Module # Type tags (but be unique and in range 0..255)
instance: Final = 0 # Instance
tuple_type: Final = 1 # TupleType
... Some type tags would be reserved for primitives, such as Serialization and deserialization methods could look like this: from mypy.serialize import fields, types
...
class TupleType:
...
def serialize(self, w: Writer) -> None:
w.write_type_tag(types.tuple_type)
w.write_type_list_field(fields.items, self.items)
w.write_instance_field(fields.partial_fallback, self.partial_fallback)
w.write_bool_field(fields.implicit, self.implicit)
@classmethod
def deserialize(cls, r: Reader) -> Instance:
r.read_type_tag(types.tuple_type)
return TupleType(
r.read_type_list_field(fields.items),
r.read_instance_field(fields.partial_fallback),
implicit=r.read_bool_field(fields.implicit),
) Performance with mypycCurrently, performance with the binary format could probably be better than JSON when compiled with mypyc, but worse without compilation. With further mypyc improvements, we can probably squeeze in a bunch of additional performance. It's too early to give precise estimates, but I expect that we can make deserialization 2x to 3x faster with a binary format (assuming some new mypyc features). |
For `mypy -c 'import torch'`, the cache load time goes from 0.44s to 0.25s as measured by manager's data_json_load_time. If I time dump times specifically, I see a saving of 0.65s to 0.07s. Overall, a pretty reasonable perf win -- should we make it a required dependency? See also #3456
For `mypy -c 'import torch'`, the cache load time goes from 0.44s to 0.25s as measured by manager's data_json_load_time. If I time dump times specifically, I see a saving of 0.65s to 0.07s. Overall, a pretty reasonable perf win -- should we make it a required dependency? See also #3456
Deserialization of cache files is a major bottleneck in incremental mode, so any performance improvements there would be valuable.
At least @pkch has previously worked on this, and it's unclear if any significant wins are achievable without a lot of work. However, there seem to be some redundant information in the cache files that we haven't looked at eliminating yet.
Here are some random ideas:
int
andstr
."defaults": true
to mark that we can skip processing non-default values altogether.Idea for how to move forward:
In any case, this issue isn't important enough to spend a lot of time on it. We have other plans to speed up mypy, but as they will take more effort, so some quick wins here would be nice.
The text was updated successfully, but these errors were encountered: