-
Notifications
You must be signed in to change notification settings - Fork 71
Complex join fails with memory error #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That is a perfectly valid use-case and that is exactly what I am looking for to test out the limits of the current implementation! Thanks for this - that is definitely something we can debug together :-) Some background: complex joins are definitely hard, as we basically need to join everything with everything to test out the join condition (there might be some shortcuts on "<" and ">" but the optimization is not as good by now). After your issue I realized that currently I have solved that by copying all data into a single node - which hardly makes sense for big data (so I would like to reimplement that) Here would be a solution that I am currently thinking of: |
A broadcast merge might be helpful here:
dask/dask#7143
…On Sat, Mar 13, 2021, 8:33 AM Nils Braun ***@***.***> wrote:
That is a perfectly valid use-case and that is exactly what I am looking
for to test out the limits of the current implementation! Thanks for this -
that is definitely something we can debug together :-)
Some background: complex joins are definitely hard, as we basically need
to join everything with everything to test out the join condition (there
might be some shortcuts on "<" and ">" but the optimization is not as good
by now). After your issue I realized that currently I have solved that by
copying all data into a single node - which hardly makes sense for big data
(so I would like to reimplement that)
Here would be a solution that I am currently thinking of:
if the first dataset has N partitions and the second one M partitions, we
can either create M x N partitions and join each partition with each
partition. While doing so, we can already filter each partition separately
with the join condition which should reduce the data by a large amount. The
only problem is, is that this will probably cause many empty partitions
(which might be an overhead). So we probably want to re-partition
afterwards to have max(N, M) partitions. Would you think that makes sense?
I will start implementing this in dask-sql - could you maybe give it a test
once it is done?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#148 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWW6BDRQCVNNN3PJEFQCTTDNSRPANCNFSM4ZDZXOBA>
.
|
@nils-braun: I think that solution sounds promising - happy to give it a test once it's done! |
Thanks, @quasiben for the hint! Very interesting read. I chose a quite similar technique here (basically also the PR introduces a nested loop over both partitions and combines every with every). There are unfortunately some differences between the left/right/inner join from the PR and the cross join we need here, so I can not re-use the exact same code:
@timhdesilva, I added a first draft implementation in #150 - would you be able to test it? It might be a bit rought around the edges still, but for my very limited test cases (on my local machine) it worked. |
@nils-braun, thanks for the response! In order to try your update, what's the best way to update dask-sql? I installed the package using conda, so not sure what the best way is to get the updated version on my local machine. Thanks (sorry I'm a GitHub beginner)! |
Great question @timhdesilva! You can follow the instructions from the docu if you want. The only thing you need to change:
If that did not help, please contact me again! I am very happy that you try it out! |
Hi @timhdesilva - did you have a chance to look into the issue? No pressure; just did not want to be the one you are waiting for :-) |
Hi Nils,
Unfortunately haven’t gotten around to it yet - I came up with a slower solution but that works. I’m hoping to have some time to try it out later this week, but I’m unsure - do keep me updated if you make any further updates!
…________________________________
From: Nils Braun ***@***.***>
Sent: Monday, March 29, 2021 3:27:35 PM
To: nils-braun/dask-sql ***@***.***>
Cc: timhdesilva ***@***.***>; Mention ***@***.***>
Subject: Re: [nils-braun/dask-sql] Complex join fails with memory error (#148)
Hi @timhdesilva<https://github.com/timhdesilva> - did you have a chance to look into the issue? No pressure; just did not want to be the one you are waiting for :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#148 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMHTPBMFNREILMKGCJPWIQLTGDPDPANCNFSM4ZDZXOBA>.
|
Ok, then I will merge the PR with my fix. At least in my tests it worked and did also not run into the memory error. |
I am closing this issue for now. If the problem comes up again, feel free to ping me or re-open |
From @timhdesilva
The text was updated successfully, but these errors were encountered: