Skip to content

Fix N+1 problem for one-to-one and many-to-one relationships #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 18, 2019

Conversation

jnak
Copy link
Collaborator

@jnak jnak commented Oct 29, 2019

Hi everyone,

This PR fixes the N+1 problem for one-to-one and many-to-one relationships. If we think this approach makes sense, I'll send a separate PR to address one-to-many and many-to-many relationships.

Design goals

  • Be completely seamless to end-users. You should not need to modify your SQLAlchemy models or your SQLAlchemyObjectType models to benefit from this.

  • Work with any SQLAlchemy relationships, including ones that have complex primaryjoin and secondaryjoin conditions.

  • Avoid generating one large SQL query using nested JOIN in the resolver of the root field of the query. For non-trivial queries, the joined SQL statement is complex with hard-to-predict performances. See SQLAlchemy's comparison of selectin with joined loading for more context.

  • Fully work in schemas that include non-SQLAlchemyObjectType Graphene types. This batching optimization should compose well with other types, including those that use Dataloader under the hood.

  • Avoid building SQL queries manually. It's non-trivial to generate correct SQL statements that support complex joins and composite primary keys across a variety of DBs. We should lean as much as possible on SQLAlchemy to do the heavy lifting.

Non-goals

  • Only select the columns / fields that are being queried. This will be addressed in a separate PR.

  • Caching records so we don't fetch the same record twice from the DB. This is tricky to get right. We can revisit this later if there is a need.

Solution

The idea here is to have SQLAlchemyObjectType automatically generate batch resolvers for relationships. The resolvers use Dataloader to collect all the similar relationship property accesses. From there, we ask SQLAlchemy to load all the relationships of those parents as one SQL query.

SQLAlchemy does not have a public API to do this but we can piggyback on some internal APIs of the selectin eager loading strategy. It's a bit hacky but I think it's better than re-implementing and maintaining a big chunk of the selectin loader logic ourselves. You can find more details about this approach in the docstring of the batch loader. Hopefully we can get some feedback from the SQLAlchemy maintainers (cc @zzzeek).

If we decide to move forward with this PR, I'll make sure to add more test cases so we can catch regressions in the event the internal SQLALchemy APIs we rely on change.

Cheers,
J

@coveralls
Copy link

coveralls commented Oct 29, 2019

Coverage Status

Coverage increased (+0.1%) to 97.464% when pulling d7d90f0 on jnak:batch-sql into 89c3726 on graphql-python:master.

Copy link

@zzzeek zzzeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function here does not seem to be participating within the query loading process because there's no context being passed in, therefore it would be easier to maintain if you were to call straight into SelectInLoader._load_for_path() directly, sending in the most minimal state that this single function needs in order to proceed, rather than trying to reconstruct a harness around it using lots of other private APIs that can change.

Based on that, if you were to instead go to SQLAlchemy and propose a distilled utility function that calls into SelectInLoad._load_for_path given simple external arguments like Session, list of instances, and a target relationship, that can include test coverage on the SQLAlchemy side such that this single function continues to work in the same way publicly going forward, and you would no longer rely on private arrangements within SQLAlchemy and I wouldn't have to worry about rearranging internals.

class NonListRelationshipLoader(dataloader.DataLoader):
cache = False

def batch_load_fn(self, parents): # pylint: disable=method-hidden
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the purpose of this code is to get to the SelectInLoader._load_for_path method alone; the rest of the private API access here is to try to call into that function in the way that it is usually called, but it appears that this ends up loading in lots of private SQLAlchemy geometries that aren't guaranteed to stay the same. it is much eaiser for me to provide a public API version of a single call like SelectnLoader._load_for_path (with a fixed single-token path as is the case here) than it is for me to worry that the overall structure of objects like PostLoad / post_load_paths / PostLoad.loaders etc., which are actually all fairly recent additions to the internals, are going to impact external projects if I change them around. so below i will try to remark on private APIs that don't seem to be needed based on the inputs to this function.

assert parent not in session.dirty

load = Load(parent_mapper.entity).selectinload(model_attr)
query = session.query(parent_mapper.entity).options(load)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are starting with a brand new Query here and it has no state or joins at all. That is, there's no paths or options coming in. this means a lot of what happens below is much simpler.


# Taken from orm.query.Query.__iter__
# https://git.io/JeuBi
context = query._compile_context()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are making a QueryContext because it eventually gets passed into _load_for_path, where it is used to determine from the original query what the other MapperOptions on that query are, which above we know they are none, if the Query has _populated_existing on it, which we know it does not, what the current Session is, which we have right here, and that seems to be about it. So we don't actually need a fully compiled context, we just need an object with basically the Session on it. You could make a QueryContext directly, or just a short object that has the attributes that the loader needs. that is still making some private API assumptions, but it would be local to just this one loader method and if i provide a public function it wouldn't need these things.


# Taken from orm.loading.instances
# https://git.io/JeuBR
context.post_load_paths = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used by PostLoader but if you weren't using PostLoader you wouldn't need to know about this.


# Taken from orm.strategies.SelectInLoader.__init__
# https://git.io/JeuBd
selectin_strategy = getattr(parent_mapper.entity, model_attr).property._get_strategy(load.strategy)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can more directly just instantiate a SelectInLoader object, which means some assumptions about the constructor, but fewer than are needed than calling _get_strategy(load.strategy).


# Taken from orm.loading._instance_processor._instance
# https://git.io/JeuBq
post_load = PostLoad()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostLoad is likely where I'm most uncomfortable with the private API stuff. this is very esoteric internals that look as strange as they do because they are trying to do as much work up front as is possible, waiting for thousands of rows to come in where we want to minimize the call overhead, so they are hard to follow and they change all the time. Going straight to _load_for_path, if possible, likely less maintenance, even though that function could change too.

# https://git.io/Jeu4j
context.partials = {}
for parent in parents:
post_load.add_state(parent._sa_instance_state, True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_sa_instance_state is accessed publicly by using sqlalchemy.inspect(parent).

# Taken from orm.loading._instance_processor._instance
# https://git.io/JeuBn
# https://git.io/Jeu4j
context.partials = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more scary internal stuff that I don't look at unless I have to, I'm not sure if one of these functions hit upon this, I'm not seeing the codepath that does, but this is a longer codepath than I think you need here.


# Taken from orm.strategies.SelectInLoader.create_row_processor
# https://git.io/Jeu4F
selectin_path = context.query._current_path + parent_mapper._path_registry
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.query._current_path is always going to be the "root" path since you created the Query above with no "path", this is PathRegistry.root, but then you are adding parent_mapper._path_registry, but this whole expression can be shortened to parent_mapper._path_registry.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can add anything beyond @zzzeek's comments.

# TODO Batch many-to-many and one-to-many relationships
return _get_attr_resolver(obj_type, model_attr, model_attr)

class NonListRelationshipLoader(dataloader.DataLoader):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a new class for each call to _get_relationship_resolver()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because each relationship is different. The selectin loader expects a specific relationship (vs a specific child and / or parent).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you pass the relevant local variables into the constructor rather than use them as closure variables?

zzzeek
zzzeek previously approved these changes Oct 30, 2019
Copy link

@zzzeek zzzeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. you still have some exposure if I change that method for the moment, but I think overall this is more easy to adapt towards if that were to happen. public API function in SQLAlchemy should be pursued.

# For our purposes, the query_context will only used to get the session
query_context = QueryContext(session.query(parent_mapper.entity))

loader._load_for_path(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, so you still have some private API here but it's way more contained and we can always make a public function out of this, feel free to propose to SQLAlchemy.

Copy link
Collaborator Author

@jnak jnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzzeek Thank you so much for your feedback. That's super helpful. If we decide to go forward with this overall approach, I'll definitely pursue making a public utility in SQLAlchemy repo. Thanks!

# TODO Batch many-to-many and one-to-many relationships
return _get_attr_resolver(obj_type, model_attr, model_attr)

class NonListRelationshipLoader(dataloader.DataLoader):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because each relationship is different. The selectin loader expects a specific relationship (vs a specific child and / or parent).

@jnak jnak mentioned this pull request Oct 30, 2019
Nabellaleen
Nabellaleen previously approved these changes Nov 12, 2019
Copy link
Collaborator

@Nabellaleen Nabellaleen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find you PR very interesting and the help of @zzzeek is really priceful !

--

Just the build to fix :)

@jnak
Copy link
Collaborator Author

jnak commented Nov 12, 2019

@Nabellaleen Great that you're on board with the approach. I fixed the tests by disabling batching for SQLAlchemy versions < 1.2 since selectin was introduced in 1.2. Can you re-approve the diff so I can merge it?

I'm going to send a separate PR soon for batching one-to-many and many-to-many relationships.

@jnak jnak merged commit 98e6fe7 into graphql-python:master Nov 18, 2019
@cglacet
Copy link

cglacet commented Jan 16, 2020

Will this be in the next release?

@jnak
Copy link
Collaborator Author

jnak commented Jan 22, 2020

@cglacet Most likely yes. But we're currently blocked by #254 and few other dependent changes. Would you mind taking a look at it? That would be really helpful to push this over the finish line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants