Skip to content

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 7, 2024

Conversation

YarShev
Copy link
Contributor

@YarShev YarShev commented Feb 23, 2024

@YarShev YarShev force-pushed the dev/yigoshev-issue57585 branch from 4f981cb to 4580197 Compare February 23, 2024 15:04
@YarShev
Copy link
Contributor Author

YarShev commented Feb 23, 2024

pre-commit.ci autofix

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines 404 to 413
array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8))
filename = "example.csv"
np.savetxt(filename, array, delimiter=",")

%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
Wall time: 52.5 s
%time pandas_df = pandas_df.map(lambda x: x + 0.01)
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
Wall time: 56.5 s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is topandas_df = pandas_df + 0.01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

Applied .. code-block:: ipython. Thanks.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to pandas_df = pandas_df + 0.01

I understand that the binary operation is much more efficient than appliying a function with map. Since the user may apply any function on the data, I wanted to show how it performs with pandas and Modin. Do you mind we leave this as is?

Copy link
Member

@rhshadrach rhshadrach Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind we leave this as is?

It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing pandas_df + 0.01 on my machine. Because of this - I do mind.

I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now.

Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle
heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a number of small ones I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I meant rather a big number of small ones. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "big number" sounded off to me for some reason. A couple of quick searches I'm seeing very consistent recommendations to use "large number" instead, e.g. Cambridge Dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Comment on lines 454 to 462
You could also rewrite this code a bit to get the same result with much less execution time.

.. ipython:: python

%%time
modin_subset = modin_df.iloc[:100000]
modin_subset = modin_subset / modin_subset.sum(axis=0)
CPU times: user 531 ms, sys: 134 ms, total: 666 ms
Wall time: 374 ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could be misleading, but is in fact not. It seems to me a reader might think that the same change to the pandas code would also decrease its execution time but pandas remains roughly the same. I think it'd be clearer to add timing for the pandas equivalent here too:

pandas_subset = pandas_df.iloc[:100000]
pandas_subset = pandas_subset / pandas_subset.sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the same thing on my machine.

L438: You report 293ms, I get 52ms
L462: You report 72.2ms, I get 69ms

Can you check your timings again? If they hold up, what version of pandas/NumPy are you running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time might be different for my and your runs. I have Intel(R) Xeon(R) Platinum 8468. Reran again and updated the timings. I have these pandas and numpy versions installed.

pandas                    2.2.0
numpy                     1.26.0

Signed-off-by: Igoshev, Iaroslav <[email protected]>
@YarShev
Copy link
Contributor Author

YarShev commented Feb 26, 2024

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

Use Modin
---------

Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has a drop-in" -> "aims to be a drop-in" would be more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed.

@rhshadrach
Copy link
Member

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

L474, documentation is spelled incorrectly.

Signed-off-by: Igoshev, Iaroslav <[email protected]>
@YarShev
Copy link
Contributor Author

YarShev commented Feb 27, 2024

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

L474, documentation is spelled incorrectly.

Oh, I see, thanks! FIxed.

@datapythonista
Copy link
Member

I see the Dask section was added 5 years ago, where there were probably no other alternatives. I think it's surely valuable for users to know about other tools to distribute work with a pandas API, but I don't think our user guide is the place to have and have to maintain a short tutorial of Dask. Modin, PySpark/Koalas and any other system that provides a distributed pandas-like API.

What do you think if you have in each project documentation a page for pandas users that we can simply link here? @phofl @YarShev @HyukjinKwon

@YarShev
Copy link
Contributor Author

YarShev commented Feb 28, 2024

I see the Dask section was added 5 years ago, where there were probably no other alternatives. I think it's surely valuable for users to know about other tools to distribute work with a pandas API, but I don't think our user guide is the place to have and have to maintain a short tutorial of Dask. Modin, PySpark/Koalas and any other system that provides a distributed pandas-like API.

What do you think if you have in each project documentation a page for pandas users that we can simply link here? @phofl @YarShev @HyukjinKwon

While other projects can have a documentation page about their tool and pandas as to which library to use and when, I think mentioning other alternatives to pandas in pandas documentation itself is valuable to users since pandas is much more popular than the alternatives, and many more people will be able to see concrete options for efficient data processing.

@datapythonista
Copy link
Member

I don't disagree with you, and I think it's valuable for users to know the other packages in the ecosystem to scale what pandas can't handle.

My point is that in my opinion the pandas codebase is not where so much information from Dask or Modin should exist. I think we should have a paragraph so users know about the tools and what to expect about them, and all the details could be in a landing page for pandas users that you maintain in your docs, keep updated, and that we simply have a link here next to the summary of what each tool does.

Just my personal opinion, but this should be as good for the users, but easier to maintain and keep updated.

@YarShev
Copy link
Contributor Author

YarShev commented Feb 28, 2024

I tried to make the changes as concise as possible giving some pros and cons of Modin and when it should be used. I don't see anything in the changes that could be hard to maintain in future. If you have any improvements we could introduce in the paragraph, please let me know. It would be also great to hear opinions from other reviewers. cc @rhshadrach, @jbrockmendel

@rhshadrach
Copy link
Member

Building on what @datapythonista said - I don't see much advantage in having the documentation as it is in this PR vs just the intro paragraph ending with a line like See more information in [modin's documentation](https://modin.readthedocs.io/en/stable/getting_started/quickstart.html#example-instant-scalability-with-no-extra-effort). (or a better landing spot if there is one).

I don't see anything in the changes that could be hard to maintain in future.

Changes to any of the third party packages currently listed would necessitate a change to the pandas documentation and is therefore quite likely to go stale. In addition each new tool that comes along we have to review another PR. It is all around less maintenance for us to have a summary paragraph and a link to the projects documentation where they compare to pandas.

What is the disadvantage by having a link rather than the documentation live in pandas itself?

@HyukjinKwon
Copy link
Contributor

HyukjinKwon commented Feb 29, 2024

@datapythonista what about pandas here points out the tutorial of individual project?

Basically I support:

he intro paragraph ending with a line like See more information in modin's documentation. (or a better landing spot if there is one).

We won't have to duplicate the tutorials in pandas page, and individual project page.

Apache Spark is sort of doing in the same way: https://spark.apache.org/third-party-projects.html

FWIW, Koalas has been migrated to PySpark as pandas API on Spark component so I have to update it in any event.

@datapythonista
Copy link
Member

@datapythonista what about pandas here points out the tutorial of individual project?

Sounds great. My idea is that users reaching our page about scaling should know about the relevant options, but without pandas having lots of details in our docs when we can simply link to each projects docs. The exact page you link to I don't know, maybe a dedicated landing page for users coming from pandas, maybe the home of the Koalas documentation, no strong preference from my side.

Also worth mentioning that we already do it in the Ecosystem page in our website. Maybe for now I can just copy what it's there in a separate PR, and you all can review it or follow up with any change if needed. What do you think?

Signed-off-by: Igoshev, Iaroslav <[email protected]>
@YarShev
Copy link
Contributor Author

YarShev commented Feb 29, 2024

@datapythonista, @HyukjinKwon, @rhshadrach, all your thoughts make sense to me now. We should probably put the detailed information about Modin and its joint usage with pandas in a single flow to Modin's documentation. I left three small paragraphs about Modin in this PR. How does it look like to you now?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the flexibility and for updating this @YarShev. I added couple of very opinionated comments that in my opinion would make this a bit more concise and to the point. Feel free to disagree if you don't like my comments.

---------

Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and
provides the ability to scale pandas workflows across nodes and CPUs available. It is also able
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say Modin distributes computation across nodes and CPUs available which seems to pretty much repeat the same thing again to conextualize that part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable.

Comment on lines +393 to +395
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
the number of rows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's just me, but the part about partitioning seems too technical and out of context for this short intro. Feel free to disagree if you think it's key for users to know this asap when considering Modin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this page is not a getting started, I think it is fine to say some words here about partitioning, which is flexible along columns and row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being both columnar and row-wise is an important feature to mention (since some things only scale well in one axis).

the number of rows.

For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
to start scaling pandas operations with Modin and an execution engine you like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)

In my opinion, just: "For more information refer to Modin's documentation_ or the Modin's tutorials_. " would feel more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. Changed.

Signed-off-by: Igoshev, Iaroslav <[email protected]>
@YarShev
Copy link
Contributor Author

YarShev commented Mar 1, 2024

Ready for the next round of review. Hope it is fine now 😄

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @YarShev - this looks good to me pending the current discussions.

@YarShev
Copy link
Contributor Author

YarShev commented Mar 6, 2024

@datapythonista, @HyukjinKwon, just a friendly reminder for review.

@@ -374,5 +374,33 @@ datasets.

You see more dask examples at https://examples.dask.org.

Use Modin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents. I would just simply remove both Dask and Modin here, add a link to

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I would at least deduplicate them here if we want to keep those sections. e.g.,

# Using thirdparty libraries

multiple nodes blah blah..

## Dask

oneliner explanation

Link

## Modin

oneliner explanation

Link

## PySpark: Pandas API on Spark

oneliner explanation

Link


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

I would argue against the statement They all aim drop-in replacement. Modin is supposed to work after changing the import statement import pandas as pd to import modin.pandas as pd and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @HyukjinKwon about the one-liner, as I mentioned earlier I find the way things are phrased now too verbose and a bit repetitive considering we want something concise, but no big deal, I'm ok to get this merged as is, and we can iterate further, once we reduce the Dask section and add PySpark/Koalas.

@YarShev
Copy link
Contributor Author

YarShev commented Mar 7, 2024

@datapythonista, sounds good! Could you please tag me in the future PRs where you would reduce the Dask section and and PySpark/Koalas? I would like Modin to be in line with those libraries sections.

@mroeschke mroeschke added this to the 3.0 milestone Mar 7, 2024
@mroeschke mroeschke merged commit c9f876c into pandas-dev:main Mar 7, 2024
18 checks passed
@mroeschke
Copy link
Member

Thanks @YarShev. Agreed that eventually this should just link to the ecosystem docs, but this is good as is.

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…tasets` page (pandas-dev#57586)

* DOC-pandas-dev#57585: Add `Use Modin` section on `Scaling to large datasets` page

Signed-off-by: Igoshev, Iaroslav <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address comments

Signed-off-by: Igoshev, Iaroslav <[email protected]>

* Address comments

Signed-off-by: Igoshev, Iaroslav <[email protected]>

* Revert some changes

Signed-off-by: Igoshev, Iaroslav <[email protected]>

* Address comments

Signed-off-by: Igoshev, Iaroslav <[email protected]>

---------

Signed-off-by: Igoshev, Iaroslav <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Marc Garcia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Add Use Modin section on Scaling to large datasets page
6 participants