-
Notifications
You must be signed in to change notification settings - Fork 25
BUGFIX: NodeDataRepository uses PersistenceObjectIdentifier to improve query speed #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUGFIX: NodeDataRepository uses PersistenceObjectIdentifier to improve query speed #49
Conversation
@kdambekalns @daniellienert Maybe you can have a look at this PR please. It would greatly improve the speed of filling the ES build queue. |
Hey @gjwnc - interesting approach! Didn't expect that seeking is more expensive then sorting but we have the same issues - I will give it a try. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "POD"?
Should that be "POI" (Persistence Object Identifier?) Or do I need to learn a new term here? 🤷♂️
@kdambekalns Arrgh, you're right. I'm using this abbreviation since years and never noticed it 🤦 . I've renamed to |
@daniellienert Yeah, this should work, because the primary key is usually indexed as a B-Tree thus it is already sorted, and that's the reason it is faster. Behind the scenes, for the primary key, there is no need to sort since it is already indexed. The query tells the DB engine, to use a given primary key value to start with and keep the sorting already available from the primary key unique index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by reading…
@kdambekalns We had a short talk about this issue during the NEOS con. Maybe you can have a look at this again during the sprint, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great performance boost!! Tested with 200k nodes.
This PR reduces runtime of the
buildCommand
by around 90%.On a larger project, with ~350.000 nodes in the nodedatatable, it takes around 50 minutes to fill the build queue using
./flow nodeindexqueue:build --workspace live
. With this PR, it takes around 4 minutes.What I did
Instead of using
OFFSET
, I use a condition on the persistence object identifier and sort the nodedata table using the poi. This method allows the DB to provide the results in constant time for every query.Why does it work
The previous query, uses OFFSET, which means the database needs to search through the table as the offset increases. That's why the build command is slowing down as the offset increases. The DB needs to search for the correct offset every time.
To avoid this problem, we can make use of the primary key instead of the offset and start right after the last used poi to search for the next batch.