-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC-#57585: Add Use Modin
section on Scaling to large datasets
page
#57586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
4580197
be9f1e7
f0d6d07
a6521ca
0e45ac4
f37354a
23b2891
b8dfa73
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -374,5 +374,101 @@ datasets. | |
|
||
You see more dask examples at https://examples.dask.org. | ||
|
||
Use Modin | ||
--------- | ||
|
||
Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "has a drop-in" -> "aims to be a drop-in" would be more accurate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed. |
||
provides the ability to scale pandas workflows across nodes and CPUs available and | ||
to work with larger than memory datasets. To start working with Modin you just need | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
to replace a single line of code, namely, the import statement. | ||
|
||
.. ipython:: python | ||
|
||
# import pandas as pd | ||
import modin.pandas as pd | ||
|
||
After you have changed the import statement, you can proceed using the well-known pandas API | ||
to scale computation. Modin distributes computation across nodes and CPUs available utilizing | ||
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported | ||
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it | ||
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and | ||
the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas | ||
and perform a simple operation on the data. | ||
|
||
.. ipython:: python | ||
|
||
import pandas | ||
import modin.pandas as pd | ||
import numpy as np | ||
|
||
array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8)) | ||
filename = "example.csv" | ||
np.savetxt(filename, array, delimiter=",") | ||
|
||
%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) | ||
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s | ||
Wall time: 52.5 s | ||
%time pandas_df = pandas_df.map(lambda x: x + 0.01) | ||
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s | ||
Wall time: 56.5 s | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing you mean for this to be a static code block - use In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Applied
I understand that the binary operation is much more efficient than appliying a function with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now. |
||
|
||
%time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) | ||
CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s | ||
Wall time: 17.5 s | ||
%time modin_df = modin_df.map(lambda x: x + 0.01) | ||
CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s | ||
Wall time: 2.54 s | ||
|
||
We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution. | ||
Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. | ||
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation | ||
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle | ||
heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! I meant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "big number" sounded off to me for some reason. A couple of quick searches I'm seeing very consistent recommendations to use "large number" instead, e.g. Cambridge Dictionary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, thanks! |
||
What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin. | ||
|
||
.. ipython:: python | ||
|
||
from modin.pandas.io import to_pandas | ||
|
||
%%time | ||
pandas_subset = pandas_df.iloc[:100000] | ||
for col in pandas_subset.columns: | ||
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() | ||
CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms | ||
Wall time: 293 ms | ||
|
||
%%time | ||
modin_subset = modin_df.iloc[:100000] | ||
for col in modin_subset.columns: | ||
modin_subset[col] = modin_subset[col] / modin_subset[col].sum() | ||
CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s | ||
Wall time: 20.9 s | ||
|
||
%%time | ||
pandas_subset = to_pandas(modin_df.iloc[:100000]) | ||
for col in pandas_subset.columns: | ||
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() | ||
CPU times: user 566 ms, sys: 279 ms, total: 845 ms | ||
Wall time: 731 ms | ||
|
||
You could also rewrite this code a bit to get the same result with much less execution time. | ||
|
||
.. ipython:: python | ||
|
||
%%time | ||
modin_subset = modin_df.iloc[:100000] | ||
modin_subset = modin_subset / modin_subset.sum(axis=0) | ||
CPU times: user 531 ms, sys: 134 ms, total: 666 ms | ||
Wall time: 374 ms | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like it could be misleading, but is in fact not. It seems to me a reader might think that the same change to the pandas code would also decrease its execution time but pandas remains roughly the same. I think it'd be clearer to add timing for the pandas equivalent here too:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added, thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not seeing the same thing on my machine. L438: You report 293ms, I get 52ms Can you check your timings again? If they hold up, what version of pandas/NumPy are you running? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The time might be different for my and your runs. I have Intel(R) Xeon(R) Platinum 8468. Reran again and updated the timings. I have these pandas and numpy versions installed. pandas 2.2.0
numpy 1.26.0 |
||
|
||
For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away | ||
to start scaling pandas operations with Modin and an execution engine you like. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;) In my opinion, just: "For more information refer to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with you. Changed. |
||
|
||
.. _Modin: https://github.com/modin-project/modin | ||
.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest | ||
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution | ||
.. _Ray: https://github.com/ray-project/ray | ||
.. _Dask: https://dask.org | ||
.. _`MPI through unidist`: https://github.com/modin-project/unidist | ||
.. _HDK: https://github.com/intel-ai/hdk | ||
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents. I would just simply remove both Dask and Modin here, add a link to
pandas/web/pandas/community/ecosystem.md
Line 344 in a0784d2
A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, I would at least deduplicate them here if we want to keep those sections. e.g.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue against the statement
They all aim drop-in replacement
. Modin is supposed to work after changing the import statementimport pandas as pd
toimport modin.pandas as pd
and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks thatScaling to large datasets
page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.