Skip to content

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 7, 2024
96 changes: 96 additions & 0 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -374,5 +374,101 @@ datasets.

You see more dask examples at https://examples.dask.org.

Use Modin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents. I would just simply remove both Dask and Modin here, add a link to

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I would at least deduplicate them here if we want to keep those sections. e.g.,

# Using thirdparty libraries

multiple nodes blah blah..

## Dask

oneliner explanation

Link

## Modin

oneliner explanation

Link

## PySpark: Pandas API on Spark

oneliner explanation

Link


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

I would argue against the statement They all aim drop-in replacement. Modin is supposed to work after changing the import statement import pandas as pd to import modin.pandas as pd and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.

---------

Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has a drop-in" -> "aims to be a drop-in" would be more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed.

provides the ability to scale pandas workflows across nodes and CPUs available and
to work with larger than memory datasets. To start working with Modin you just need
to replace a single line of code, namely, the import statement.

.. ipython:: python

# import pandas as pd
import modin.pandas as pd

After you have changed the import statement, you can proceed using the well-known pandas API
to scale computation. Modin distributes computation across nodes and CPUs available utilizing
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas
and perform a simple operation on the data.

.. ipython:: python

import pandas
import modin.pandas as pd
import numpy as np

array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8))
filename = "example.csv"
np.savetxt(filename, array, delimiter=",")

%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
Wall time: 52.5 s
%time pandas_df = pandas_df.map(lambda x: x + 0.01)
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
Wall time: 56.5 s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is topandas_df = pandas_df + 0.01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

Applied .. code-block:: ipython. Thanks.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to pandas_df = pandas_df + 0.01

I understand that the binary operation is much more efficient than appliying a function with map. Since the user may apply any function on the data, I wanted to show how it performs with pandas and Modin. Do you mind we leave this as is?

Copy link
Member

@rhshadrach rhshadrach Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind we leave this as is?

It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing pandas_df + 0.01 on my machine. Because of this - I do mind.

I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now.


%time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s
Wall time: 17.5 s
%time modin_df = modin_df.map(lambda x: x + 0.01)
CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s
Wall time: 2.54 s

We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution.
Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle
heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a number of small ones I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I meant rather a big number of small ones. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "big number" sounded off to me for some reason. A couple of quick searches I'm seeing very consistent recommendations to use "large number" instead, e.g. Cambridge Dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin.

.. ipython:: python

from modin.pandas.io import to_pandas

%%time
pandas_subset = pandas_df.iloc[:100000]
for col in pandas_subset.columns:
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms
Wall time: 293 ms

%%time
modin_subset = modin_df.iloc[:100000]
for col in modin_subset.columns:
modin_subset[col] = modin_subset[col] / modin_subset[col].sum()
CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s
Wall time: 20.9 s

%%time
pandas_subset = to_pandas(modin_df.iloc[:100000])
for col in pandas_subset.columns:
pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum()
CPU times: user 566 ms, sys: 279 ms, total: 845 ms
Wall time: 731 ms

You could also rewrite this code a bit to get the same result with much less execution time.

.. ipython:: python

%%time
modin_subset = modin_df.iloc[:100000]
modin_subset = modin_subset / modin_subset.sum(axis=0)
CPU times: user 531 ms, sys: 134 ms, total: 666 ms
Wall time: 374 ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could be misleading, but is in fact not. It seems to me a reader might think that the same change to the pandas code would also decrease its execution time but pandas remains roughly the same. I think it'd be clearer to add timing for the pandas equivalent here too:

pandas_subset = pandas_df.iloc[:100000]
pandas_subset = pandas_subset / pandas_subset.sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the same thing on my machine.

L438: You report 293ms, I get 52ms
L462: You report 72.2ms, I get 69ms

Can you check your timings again? If they hold up, what version of pandas/NumPy are you running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time might be different for my and your runs. I have Intel(R) Xeon(R) Platinum 8468. Reran again and updated the timings. I have these pandas and numpy versions installed.

pandas                    2.2.0
numpy                     1.26.0


For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
to start scaling pandas operations with Modin and an execution engine you like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)

In my opinion, just: "For more information refer to Modin's documentation_ or the Modin's tutorials_. " would feel more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. Changed.


.. _Modin: https://github.com/modin-project/modin
.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
.. _Ray: https://github.com/ray-project/ray
.. _Dask: https://dask.org
.. _`MPI through unidist`: https://github.com/modin-project/unidist
.. _HDK: https://github.com/intel-ai/hdk
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html