-
Notifications
You must be signed in to change notification settings - Fork 227
Fix N+1 problem for one-to-one and many-to-one relationships #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function here does not seem to be participating within the query loading process because there's no context being passed in, therefore it would be easier to maintain if you were to call straight into SelectInLoader._load_for_path() directly, sending in the most minimal state that this single function needs in order to proceed, rather than trying to reconstruct a harness around it using lots of other private APIs that can change.
Based on that, if you were to instead go to SQLAlchemy and propose a distilled utility function that calls into SelectInLoad._load_for_path given simple external arguments like Session, list of instances, and a target relationship, that can include test coverage on the SQLAlchemy side such that this single function continues to work in the same way publicly going forward, and you would no longer rely on private arrangements within SQLAlchemy and I wouldn't have to worry about rearranging internals.
class NonListRelationshipLoader(dataloader.DataLoader): | ||
cache = False | ||
|
||
def batch_load_fn(self, parents): # pylint: disable=method-hidden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the purpose of this code is to get to the SelectInLoader._load_for_path method alone; the rest of the private API access here is to try to call into that function in the way that it is usually called, but it appears that this ends up loading in lots of private SQLAlchemy geometries that aren't guaranteed to stay the same. it is much eaiser for me to provide a public API version of a single call like SelectnLoader._load_for_path (with a fixed single-token path as is the case here) than it is for me to worry that the overall structure of objects like PostLoad / post_load_paths / PostLoad.loaders etc., which are actually all fairly recent additions to the internals, are going to impact external projects if I change them around. so below i will try to remark on private APIs that don't seem to be needed based on the inputs to this function.
graphene_sqlalchemy/types.py
Outdated
assert parent not in session.dirty | ||
|
||
load = Load(parent_mapper.entity).selectinload(model_attr) | ||
query = session.query(parent_mapper.entity).options(load) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are starting with a brand new Query here and it has no state or joins at all. That is, there's no paths or options coming in. this means a lot of what happens below is much simpler.
graphene_sqlalchemy/types.py
Outdated
|
||
# Taken from orm.query.Query.__iter__ | ||
# https://git.io/JeuBi | ||
context = query._compile_context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we are making a QueryContext because it eventually gets passed into _load_for_path, where it is used to determine from the original query what the other MapperOptions on that query are, which above we know they are none, if the Query has _populated_existing on it, which we know it does not, what the current Session is, which we have right here, and that seems to be about it. So we don't actually need a fully compiled context, we just need an object with basically the Session on it. You could make a QueryContext directly, or just a short object that has the attributes that the loader needs. that is still making some private API assumptions, but it would be local to just this one loader method and if i provide a public function it wouldn't need these things.
graphene_sqlalchemy/types.py
Outdated
|
||
# Taken from orm.loading.instances | ||
# https://git.io/JeuBR | ||
context.post_load_paths = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used by PostLoader but if you weren't using PostLoader you wouldn't need to know about this.
graphene_sqlalchemy/types.py
Outdated
|
||
# Taken from orm.strategies.SelectInLoader.__init__ | ||
# https://git.io/JeuBd | ||
selectin_strategy = getattr(parent_mapper.entity, model_attr).property._get_strategy(load.strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can more directly just instantiate a SelectInLoader object, which means some assumptions about the constructor, but fewer than are needed than calling _get_strategy(load.strategy).
graphene_sqlalchemy/types.py
Outdated
|
||
# Taken from orm.loading._instance_processor._instance | ||
# https://git.io/JeuBq | ||
post_load = PostLoad() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostLoad is likely where I'm most uncomfortable with the private API stuff. this is very esoteric internals that look as strange as they do because they are trying to do as much work up front as is possible, waiting for thousands of rows to come in where we want to minimize the call overhead, so they are hard to follow and they change all the time. Going straight to _load_for_path, if possible, likely less maintenance, even though that function could change too.
graphene_sqlalchemy/types.py
Outdated
# https://git.io/Jeu4j | ||
context.partials = {} | ||
for parent in parents: | ||
post_load.add_state(parent._sa_instance_state, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_sa_instance_state is accessed publicly by using sqlalchemy.inspect(parent).
graphene_sqlalchemy/types.py
Outdated
# Taken from orm.loading._instance_processor._instance | ||
# https://git.io/JeuBn | ||
# https://git.io/Jeu4j | ||
context.partials = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more scary internal stuff that I don't look at unless I have to, I'm not sure if one of these functions hit upon this, I'm not seeing the codepath that does, but this is a longer codepath than I think you need here.
graphene_sqlalchemy/types.py
Outdated
|
||
# Taken from orm.strategies.SelectInLoader.create_row_processor | ||
# https://git.io/Jeu4F | ||
selectin_path = context.query._current_path + parent_mapper._path_registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context.query._current_path is always going to be the "root" path since you created the Query above with no "path", this is PathRegistry.root, but then you are adding parent_mapper._path_registry, but this whole expression can be shortened to parent_mapper._path_registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can add anything beyond @zzzeek's comments.
# TODO Batch many-to-many and one-to-many relationships | ||
return _get_attr_resolver(obj_type, model_attr, model_attr) | ||
|
||
class NonListRelationshipLoader(dataloader.DataLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create a new class for each call to _get_relationship_resolver()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because each relationship is different. The selectin
loader expects a specific relationship (vs a specific child and / or parent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you pass the relevant local variables into the constructor rather than use them as closure variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. you still have some exposure if I change that method for the moment, but I think overall this is more easy to adapt towards if that were to happen. public API function in SQLAlchemy should be pursued.
# For our purposes, the query_context will only used to get the session | ||
query_context = QueryContext(session.query(parent_mapper.entity)) | ||
|
||
loader._load_for_path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, so you still have some private API here but it's way more contained and we can always make a public function out of this, feel free to propose to SQLAlchemy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzzeek Thank you so much for your feedback. That's super helpful. If we decide to go forward with this overall approach, I'll definitely pursue making a public utility in SQLAlchemy
repo. Thanks!
# TODO Batch many-to-many and one-to-many relationships | ||
return _get_attr_resolver(obj_type, model_attr, model_attr) | ||
|
||
class NonListRelationshipLoader(dataloader.DataLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because each relationship is different. The selectin
loader expects a specific relationship (vs a specific child and / or parent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find you PR very interesting and the help of @zzzeek is really priceful !
--
Just the build to fix :)
@Nabellaleen Great that you're on board with the approach. I fixed the tests by disabling batching for SQLAlchemy versions < 1.2 since I'm going to send a separate PR soon for batching |
Will this be in the next release? |
Hi everyone,
This PR fixes the N+1 problem for
one-to-one
andmany-to-one
relationships. If we think this approach makes sense, I'll send a separate PR to addressone-to-many
andmany-to-many
relationships.Design goals
Be completely seamless to end-users. You should not need to modify your
SQLAlchemy
models or yourSQLAlchemyObjectType
models to benefit from this.Work with any
SQLAlchemy
relationships, including ones that have complexprimaryjoin
andsecondaryjoin
conditions.Avoid generating one large SQL query using nested
JOIN
in the resolver of the root field of the query. For non-trivial queries, the joined SQL statement is complex with hard-to-predict performances. See SQLAlchemy's comparison ofselectin
withjoined
loading for more context.Fully work in schemas that include non-
SQLAlchemyObjectType
Graphene types. This batching optimization should compose well with other types, including those that useDataloader
under the hood.Avoid building SQL queries manually. It's non-trivial to generate correct SQL statements that support complex joins and composite primary keys across a variety of DBs. We should lean as much as possible on SQLAlchemy to do the heavy lifting.
Non-goals
Only select the columns / fields that are being queried. This will be addressed in a separate PR.
Caching records so we don't fetch the same record twice from the DB. This is tricky to get right. We can revisit this later if there is a need.
Solution
The idea here is to have
SQLAlchemyObjectType
automatically generate batch resolvers for relationships. The resolvers useDataloader
to collect all the similarrelationship
property accesses. From there, we askSQLAlchemy
to load all the relationships of those parents as one SQL query.SQLAlchemy
does not have a public API to do this but we can piggyback on some internal APIs of theselectin
eager loading strategy. It's a bit hacky but I think it's better than re-implementing and maintaining a big chunk of theselectin
loader logic ourselves. You can find more details about this approach in the docstring of the batch loader. Hopefully we can get some feedback from theSQLAlchemy
maintainers (cc @zzzeek).If we decide to move forward with this PR, I'll make sure to add more test cases so we can catch regressions in the event the internal
SQLALchemy
APIs we rely on change.Cheers,
J