-
Notifications
You must be signed in to change notification settings - Fork 98
feat: Implementation of run partition query #1080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
QUEUE_SIZE_PER_WORKER = 32 | ||
MAX_PARALLELISM = 100 | ||
METADATA_LOCK = Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a global lock, right? Would it be possible to make it an instance variable for the MergedResultSet class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we want it to be an instance of MergedResultSet because it is not used by MergedResultSet class but used just by static _set_metadata()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that if you have multiple MergedResultSets open, then they will block each other, which is not necessary. So a more correct design would be to have the lock as an instance variable, and the _set_meta_data
as an instance method. The goal of this lock is to prevent multiple threads from setting/reading the metadata field of a specific MergedResultSet
at the same time, not to prevent different MergedResultSet
s from setting their respective metadata fields.
I agree that it is a bit theoretical, as it is unlikely that a user will have a large number of MergedResultSet
s open at the same time, but putting the lock where it is actually needed will make the code easier to read and understand. Now it seems like it would be a problem if two different MergedResultSet
s try to set their metadata at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, Changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good to me, with a couple of nits on the setting and locking for metadata
.
|
||
QUEUE_SIZE_PER_WORKER = 32 | ||
MAX_PARALLELISM = 100 | ||
METADATA_LOCK = Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that if you have multiple MergedResultSets open, then they will block each other, which is not necessary. So a more correct design would be to have the lock as an instance variable, and the _set_meta_data
as an instance method. The goal of this lock is to prevent multiple threads from setting/reading the metadata field of a specific MergedResultSet
at the same time, not to prevent different MergedResultSet
s from setting their respective metadata fields.
I agree that it is a bit theoretical, as it is unlikely that a user will have a large number of MergedResultSet
s open at the same time, but putting the lock where it is actually needed will make the code easier to read and understand. Now it seems like it would be a problem if two different MergedResultSet
s try to set their metadata at the same time.
if merged_result_set._metadata is None: | ||
_set_metadata(merged_result_set, results) | ||
except Exception as ex: | ||
self._queue.put(PartitionExecutorResult(exception=ex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also call _set_metadata
here (if it has not already been set) to prevent the metadata
property from blocking indefinitely if someone tries to call that after an error has occurred. If for example the query fails for all partitions, then the user will get a MergedResultSet
that returns an error whenever you try to iterate over the rows, but that hangs forever if you try to call metadata
. The latter is always very hard to debug, so we should whenever possible return an error instead of block when something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
No description provided.