Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize class creation #132042

Open
JelleZijlstra opened this issue Apr 3, 2025 · 6 comments
Open

Optimize class creation #132042

JelleZijlstra opened this issue Apr 3, 2025 · 6 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@JelleZijlstra
Copy link
Member

JelleZijlstra commented Apr 3, 2025

Currently, creating an empty class is about 70x slower than creating an empty function in my profiling. Classes are much more complex and it makes sense that they're slower to create, but 70x feels excessive. (Related: #118761.)

I ran some profiling on my Mac with a sample script that just made empty classes in a loop:

Image

A few things stood out:

  • A lot of time is spent updating slot definitions, i.e. filling in all of the tp_*, nb_*, etc. functions in the C struct for the type. We do this by iterating over all the slots, then looking up the function name (e.g., __add__) in the MRO and placing it in the slot for this class.
  • Significant time is spent in resolve_slotdups which has a comment "XXX Maybe this could be optimized more -- but is it worth it?". Sounds promising. It helps deal with cases where one name maps to multiple slots (e.g. __add__ is both nb_add and sq_concat), and does that by iterating over all the slotdefs and finding other slots with the same name. It does that using some scratch space in the interpreter state, which seems not thread-safe. I feel we could precompute the data instead, so we don't have to figure it out at runtime. For example, the slotdef struct could grow a new member to indicate whether or not the name is unique.

Most types will define very few of these slots, so it makes sense to try to look for an approach that does less work for slots without changes. I think something like this should work:

  • First fill in the slots table with all the slots from the first base class.
  • Then collect all slots for which we may need changes: either slots that have a non-NULL value in the second or later base, or slots the name of which appears in the new class's __dict__. For those slots only, perform an update.

This should make it possible to make class creation something like 2x faster. I haven't started working on implementing this and I may not have time to do it; if you see this and are interested, feel free to pick it up!

Linked PRs

@JelleZijlstra JelleZijlstra added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Apr 3, 2025
@sergey-miryanov
Copy link
Contributor

I would try if @AA-Turner doesn't pick this up :)

@markshannon
Copy link
Member

I think the whole concept of slots (the tp_... slots, not __slots__ or PyType_Slot) as an optimization is the root of the problem, they slow down class creation and don't help performance as they complicate the real optimizations that we perform.

We should view the tp_slots as doing two distinct things:

  1. Specifying the behavior of operations when the struct _typeobject is passed to PyType_Ready
  2. A backwards compatible way for C extensions to access operations. Eg. iter(i) as Py_TYPE(i)->tp_iter

For pure Python objects, all slots can be filled in with a function that does the dynamic lookup, which should be very quick.
Once resolved, we can overwrite the slot with a more direct version.

For classes defined by struct _typeobject we can just replace the NULLs with the dynamic lookup function.
For classes defined from PyType_Spec we fill in the defined slots and then replace the NULLs with the dynamic lookup function.

Also, see faster-cpython/ideas#146 (comment)

@markshannon
Copy link
Member

The bytecode for creating classes is also a bit of a mess. We seem to be creating code objects, just to create functions just to call them, to do things that could easily be done inline.

There is also a fair bit of machinery about finding the metaclass and the base class tuple. We should compute those in the interpreter as pass them into the class creation machinery.

E.g given a BUILD_CLASS instruction that expects name, meta, bases, dict for class C: ... we get:

    LOAD_CONST "C"
    LOAD_CONST object
    LOAD_CONST ()
    // create method dictionary
    BUILD_CLASS

For class C(D): ... we get:

    LOAD_CONST "C"
    LOAD_NAME "D"
    COPY 1
    LOAD_ATTR __class__ # Get the metaclass
    SWAP 2
    BUILD_TUPLE 1
    // create method dictionary
    COPY 2
    LOAD_ATTR "__prepare__"
    SWAP 2 
    CALL 1   # meta.__prepare__(method_dict)

For multiple inheritance LOAD_ATTR __class__ becomes CALL_INSTRINSIC 2 calculate_metaclass and
if metaclass is explicit, metaclass=expr, then COPY 1; LOAD_ATTR __class__ becomes just expr.

I'm missing the code for setting __orig_bases__ and __class__, but I think those come after BUILD_CLASS.

@vstinner
Copy link
Member

vstinner commented Apr 3, 2025

A lot of time is spent updating slot definitions, i.e. filling in all of the tp_, nb_, etc. functions in the C struct for the type. We do this by iterating over all the slots, then looking up the function name (e.g., add) in the MRO and placing it in the slot for this class.

Previous attempt in 2017: #76527

@picnixz picnixz added the type-feature A feature request or enhancement label Apr 4, 2025
@sergey-miryanov
Copy link
Contributor

I have added some tests results - #132156 (comment)
I want to dig deeper, but if you are interested in some numbers - please take a look. And maybe stop me from further research.

@sergey-miryanov
Copy link
Contributor

Ok, I got rid of resolve_slotdups and believe it is ready for review. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants