Skip to content

BUGFIX: NodeDataRepository uses PersistenceObjectIdentifier to improve query speed #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gjwnc
Copy link
Contributor

@gjwnc gjwnc commented Sep 15, 2021

This PR reduces runtime of the buildCommand by around 90%.

On a larger project, with ~350.000 nodes in the nodedatatable, it takes around 50 minutes to fill the build queue using ./flow nodeindexqueue:build --workspace live. With this PR, it takes around 4 minutes.

What I did

Instead of using OFFSET, I use a condition on the persistence object identifier and sort the nodedata table using the poi. This method allows the DB to provide the results in constant time for every query.

Why does it work

The previous query, uses OFFSET, which means the database needs to search through the table as the offset increases. That's why the build command is slowing down as the offset increases. The DB needs to search for the correct offset every time.
To avoid this problem, we can make use of the primary key instead of the offset and start right after the last used poi to search for the next batch.

@gjwnc
Copy link
Contributor Author

gjwnc commented Mar 30, 2022

@kdambekalns @daniellienert Maybe you can have a look at this PR please. It would greatly improve the speed of filling the ES build queue.

@daniellienert daniellienert self-requested a review March 30, 2022 18:21
@daniellienert
Copy link
Contributor

Hey @gjwnc - interesting approach! Didn't expect that seeking is more expensive then sorting but we have the same issues - I will give it a try.

Copy link
Member

@kdambekalns kdambekalns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "POD"?

Should that be "POI" (Persistence Object Identifier?) Or do I need to learn a new term here? 🤷‍♂️

@gjwnc
Copy link
Contributor Author

gjwnc commented Mar 31, 2022

What is a "POD"?

Should that be "POI" (Persistence Object Identifier?) Or do I need to learn a new term here? man_shrugging

@kdambekalns Arrgh, you're right. I'm using this abbreviation since years and never noticed it 🤦 . I've renamed to POI

@gjwnc
Copy link
Contributor Author

gjwnc commented Mar 31, 2022

@daniellienert Yeah, this should work, because the primary key is usually indexed as a B-Tree thus it is already sorted, and that's the reason it is faster. Behind the scenes, for the primary key, there is no need to sort since it is already indexed. The query tells the DB engine, to use a given primary key value to start with and keep the sorting already available from the primary key unique index.

Copy link
Member

@kdambekalns kdambekalns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by reading…

@gjwnc
Copy link
Contributor Author

gjwnc commented May 3, 2022

@kdambekalns We had a short talk about this issue during the NEOS con. Maybe you can have a look at this again during the sprint, please?

Copy link
Contributor

@daniellienert daniellienert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great performance boost!! Tested with 200k nodes.

@daniellienert daniellienert merged commit bfc80c4 into Flowpack:master May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants