Skip to content

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 7, 2024
29 changes: 29 additions & 0 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -374,5 +374,34 @@ datasets.

You see more dask examples at https://examples.dask.org.

Use Modin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents. I would just simply remove both Dask and Modin here, add a link to

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I would at least deduplicate them here if we want to keep those sections. e.g.,

# Using thirdparty libraries

multiple nodes blah blah..

## Dask

oneliner explanation

Link

## Modin

oneliner explanation

Link

## PySpark: Pandas API on Spark

oneliner explanation

Link


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

I would argue against the statement They all aim drop-in replacement. Modin is supposed to work after changing the import statement import pandas as pd to import modin.pandas as pd and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.

---------

Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and
provides the ability to scale pandas workflows across nodes and CPUs available. It is also able
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say Modin distributes computation across nodes and CPUs available which seems to pretty much repeat the same thing again to conextualize that part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable.

to work with larger than memory datasets. To start working with Modin you just need
to replace a single line of code, namely, the import statement.

.. code-block:: ipython

# import pandas as pd
import modin.pandas as pd

After you have changed the import statement, you can proceed using the well-known pandas API
to scale computation. Modin distributes computation across nodes and CPUs available utilizing
an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
the number of rows.
Comment on lines +393 to +395
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's just me, but the part about partitioning seems too technical and out of context for this short intro. Feel free to disagree if you think it's key for users to know this asap when considering Modin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this page is not a getting started, I think it is fine to say some words here about partitioning, which is flexible along columns and row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being both columnar and row-wise is an important feature to mention (since some things only scale well in one axis).


For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
to start scaling pandas operations with Modin and an execution engine you like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)

In my opinion, just: "For more information refer to Modin's documentation_ or the Modin's tutorials_. " would feel more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. Changed.


.. _Modin: https://github.com/modin-project/modin
.. _`Modin's documentation`: https://modin.readthedocs.io/en/latest
.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution
.. _Ray: https://github.com/ray-project/ray
.. _Dask: https://dask.org
.. _`MPI through unidist`: https://github.com/modin-project/unidist
.. _HDK: https://github.com/intel-ai/hdk
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html