DOC-#57585: Add `Use Modin` section on `Scaling to large datasets` page #57586

YarShev · 2024-02-23T14:55:27Z

closes DOC: Add Use Modin section on Scaling to large datasets page #57585 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…tasets` page Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev · 2024-02-23T15:11:17Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

rhshadrach

Thanks for the PR!

rhshadrach · 2024-02-25T15:41:30Z

doc/source/user_guide/scale.rst

+   array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8))
+   filename = "example.csv"
+   np.savetxt(filename, array, delimiter=",")
+
+   %time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)])
+   CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s
+   Wall time: 52.5 s
+   %time pandas_df = pandas_df.map(lambda x: x + 0.01)
+   CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s
+   Wall time: 56.5 s


I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is topandas_df = pandas_df + 0.01

I'm guessing you mean for this to be a static code block - use .. code-block:: ipython instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.

Applied .. code-block:: ipython. Thanks.

In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to pandas_df = pandas_df + 0.01

I understand that the binary operation is much more efficient than appliying a function with map. Since the user may apply any function on the data, I wanted to show how it performs with pandas and Modin. Do you mind we leave this as is?

Do you mind we leave this as is?

It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing pandas_df + 0.01 on my machine. Because of this - I do mind.

I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent.

Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now.

rhshadrach · 2024-02-25T15:45:52Z

doc/source/user_guide/scale.rst

+Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms.
+It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation
+in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle
+heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks.


rather than a number of small ones I think?

Good catch! I meant rather a big number of small ones. Fixed.

nit: "big number" sounded off to me for some reason. A couple of quick searches I'm seeing very consistent recommendations to use "large number" instead, e.g. Cambridge Dictionary.

Fixed, thanks!

rhshadrach · 2024-02-25T15:54:36Z

doc/source/user_guide/scale.rst

+You could also rewrite this code a bit to get the same result with much less execution time.
+
+.. ipython:: python
+
+   %%time
+   modin_subset = modin_df.iloc[:100000]
+   modin_subset = modin_subset / modin_subset.sum(axis=0)
+   CPU times: user 531 ms, sys: 134 ms, total: 666 ms
+   Wall time: 374 ms


This looks like it could be misleading, but is in fact not. It seems to me a reader might think that the same change to the pandas code would also decrease its execution time but pandas remains roughly the same. I think it'd be clearer to add timing for the pandas equivalent here too:

pandas_subset = pandas_df.iloc[:100000] pandas_subset = pandas_subset / pandas_subset.sum()

Added, thanks!

I'm not seeing the same thing on my machine.

L438: You report 293ms, I get 52ms
L462: You report 72.2ms, I get 69ms

Can you check your timings again? If they hold up, what version of pandas/NumPy are you running?

The time might be different for my and your runs. I have Intel(R) Xeon(R) Platinum 8468. Reran again and updated the timings. I have these pandas and numpy versions installed.

pandas 2.2.0 numpy 1.26.0

doc/source/user_guide/scale.rst

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev · 2024-02-26T14:27:38Z

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

jbrockmendel · 2024-02-26T20:58:30Z

doc/source/user_guide/scale.rst

+Use Modin
+---------
+
+Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and


"has a drop-in" -> "aims to be a drop-in" would be more accurate

@jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement.

In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these.

I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed.

rhshadrach · 2024-02-26T22:00:59Z

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

L474, documentation is spelled incorrectly.

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev · 2024-02-27T10:53:10Z

@rhshadrach, could you shed some light on why Doc Build is failing and what we should do to fix it?

L474, documentation is spelled incorrectly.

Oh, I see, thanks! FIxed.

datapythonista · 2024-02-27T15:43:07Z

I see the Dask section was added 5 years ago, where there were probably no other alternatives. I think it's surely valuable for users to know about other tools to distribute work with a pandas API, but I don't think our user guide is the place to have and have to maintain a short tutorial of Dask. Modin, PySpark/Koalas and any other system that provides a distributed pandas-like API.

What do you think if you have in each project documentation a page for pandas users that we can simply link here? @phofl @YarShev @HyukjinKwon

YarShev · 2024-02-28T12:29:23Z

I see the Dask section was added 5 years ago, where there were probably no other alternatives. I think it's surely valuable for users to know about other tools to distribute work with a pandas API, but I don't think our user guide is the place to have and have to maintain a short tutorial of Dask. Modin, PySpark/Koalas and any other system that provides a distributed pandas-like API.

What do you think if you have in each project documentation a page for pandas users that we can simply link here? @phofl @YarShev @HyukjinKwon

While other projects can have a documentation page about their tool and pandas as to which library to use and when, I think mentioning other alternatives to pandas in pandas documentation itself is valuable to users since pandas is much more popular than the alternatives, and many more people will be able to see concrete options for efficient data processing.

datapythonista · 2024-02-28T12:41:06Z

I don't disagree with you, and I think it's valuable for users to know the other packages in the ecosystem to scale what pandas can't handle.

My point is that in my opinion the pandas codebase is not where so much information from Dask or Modin should exist. I think we should have a paragraph so users know about the tools and what to expect about them, and all the details could be in a landing page for pandas users that you maintain in your docs, keep updated, and that we simply have a link here next to the summary of what each tool does.

Just my personal opinion, but this should be as good for the users, but easier to maintain and keep updated.

YarShev · 2024-02-28T13:17:57Z

I tried to make the changes as concise as possible giving some pros and cons of Modin and when it should be used. I don't see anything in the changes that could be hard to maintain in future. If you have any improvements we could introduce in the paragraph, please let me know. It would be also great to hear opinions from other reviewers. cc @rhshadrach, @jbrockmendel

rhshadrach · 2024-02-28T22:30:42Z

Building on what @datapythonista said - I don't see much advantage in having the documentation as it is in this PR vs just the intro paragraph ending with a line like See more information in [modin's documentation](https://modin.readthedocs.io/en/stable/getting_started/quickstart.html#example-instant-scalability-with-no-extra-effort). (or a better landing spot if there is one).

I don't see anything in the changes that could be hard to maintain in future.

Changes to any of the third party packages currently listed would necessitate a change to the pandas documentation and is therefore quite likely to go stale. In addition each new tool that comes along we have to review another PR. It is all around less maintenance for us to have a summary paragraph and a link to the projects documentation where they compare to pandas.

What is the disadvantage by having a link rather than the documentation live in pandas itself?

HyukjinKwon · 2024-02-29T00:01:19Z

@datapythonista what about pandas here points out the tutorial of individual project?

Basically I support:

he intro paragraph ending with a line like See more information in modin's documentation. (or a better landing spot if there is one).

We won't have to duplicate the tutorials in pandas page, and individual project page.

Apache Spark is sort of doing in the same way: https://spark.apache.org/third-party-projects.html

FWIW, Koalas has been migrated to PySpark as pandas API on Spark component so I have to update it in any event.

datapythonista · 2024-02-29T08:57:22Z

@datapythonista what about pandas here points out the tutorial of individual project?

Sounds great. My idea is that users reaching our page about scaling should know about the relevant options, but without pandas having lots of details in our docs when we can simply link to each projects docs. The exact page you link to I don't know, maybe a dedicated landing page for users coming from pandas, maybe the home of the Koalas documentation, no strong preference from my side.

Also worth mentioning that we already do it in the Ecosystem page in our website. Maybe for now I can just copy what it's there in a separate PR, and you all can review it or follow up with any change if needed. What do you think?

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev · 2024-02-29T13:21:29Z

@datapythonista, @HyukjinKwon, @rhshadrach, all your thoughts make sense to me now. We should probably put the detailed information about Modin and its joint usage with pandas in a single flow to Modin's documentation. I left three small paragraphs about Modin in this PR. How does it look like to you now?

datapythonista

Thanks for the flexibility and for updating this @YarShev. I added couple of very opinionated comments that in my opinion would make this a bit more concise and to the point. Feel free to disagree if you don't like my comments.

datapythonista · 2024-02-29T19:16:39Z

doc/source/user_guide/scale.rst

+---------
+
+Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and
+provides the ability to scale pandas workflows across nodes and CPUs available. It is also able


Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say Modin distributes computation across nodes and CPUs available which seems to pretty much repeat the same thing again to conextualize that part?

I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise.

To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable.

datapythonista · 2024-02-29T19:18:58Z

doc/source/user_guide/scale.rst

+in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it
+along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and
+the number of rows.


Not sure if it's just me, but the part about partitioning seems too technical and out of context for this short intro. Feel free to disagree if you think it's key for users to know this asap when considering Modin.

Since this page is not a getting started, I think it is fine to say some words here about partitioning, which is flexible along columns and row.

I think being both columnar and row-wise is an important feature to mention (since some things only scale well in one axis).

datapythonista · 2024-02-29T19:21:18Z

doc/source/user_guide/scale.rst

+the number of rows.
+
+For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away
+to start scaling pandas operations with Modin and an execution engine you like.


Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)

In my opinion, just: "For more information refer to Modin's documentation_ or the Modin's tutorials_. " would feel more appropriate.

Agree with you. Changed.

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev · 2024-03-01T09:30:05Z

Ready for the next round of review. Hope it is fine now 😄

rhshadrach

Thanks for the changes @YarShev - this looks good to me pending the current discussions.

YarShev · 2024-03-06T11:57:44Z

@datapythonista, @HyukjinKwon, just a friendly reminder for review.

HyukjinKwon · 2024-03-06T23:40:53Z

doc/source/user_guide/scale.rst

@@ -374,5 +374,33 @@ datasets.

 You see more dask examples at https://examples.dask.org.

+Use Modin


Just my two cents. I would just simply remove both Dask and Modin here, add a link to

pandas/web/pandas/community/ecosystem.md

Line 344 in a0784d2

## Out-of-core

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

Or, I would at least deduplicate them here if we want to keep those sections. e.g.,

# Using thirdparty libraries multiple nodes blah blah.. ## Dask oneliner explanation Link ## Modin oneliner explanation Link ## PySpark: Pandas API on Spark oneliner explanation Link

A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.

I would argue against the statement They all aim drop-in replacement. Modin is supposed to work after changing the import statement import pandas as pd to import modin.pandas as pd and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?

I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.

datapythonista

I agree with @HyukjinKwon about the one-liner, as I mentioned earlier I find the way things are phrased now too verbose and a bit repetitive considering we want something concise, but no big deal, I'm ok to get this merged as is, and we can iterate further, once we reduce the Dask section and add PySpark/Koalas.

YarShev · 2024-03-07T14:24:33Z

@datapythonista, sounds good! Could you please tag me in the future PRs where you would reduce the Dask section and and PySpark/Koalas? I would like Modin to be in line with those libraries sections.

mroeschke · 2024-03-07T17:17:42Z

Thanks @YarShev. Agreed that eventually this should just link to the ecosystem docs, but this is good as is.

…tasets` page (pandas-dev#57586) * DOC-pandas-dev#57585: Add `Use Modin` section on `Scaling to large datasets` page Signed-off-by: Igoshev, Iaroslav <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> * Revert some changes Signed-off-by: Igoshev, Iaroslav <[email protected]> * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> --------- Signed-off-by: Igoshev, Iaroslav <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Marc Garcia <[email protected]>

YarShev force-pushed the dev/yigoshev-issue57585 branch from 4f981cb to 4580197 Compare February 23, 2024 15:04

DOC-pandas-dev#57585: Add Use Modin section on `Scaling to large da…

4580197

…tasets` page Signed-off-by: Igoshev, Iaroslav <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

be9f1e7

for more information, see https://pre-commit.ci

rhshadrach requested changes Feb 25, 2024

View reviewed changes

Address comments

f0d6d07

Signed-off-by: Igoshev, Iaroslav <[email protected]>

jbrockmendel reviewed Feb 26, 2024

View reviewed changes

Address comments

a6521ca

Signed-off-by: Igoshev, Iaroslav <[email protected]>

datapythonista added the Docs label Feb 27, 2024

Revert some changes

0e45ac4

Signed-off-by: Igoshev, Iaroslav <[email protected]>

datapythonista reviewed Feb 29, 2024

View reviewed changes

Address comments

f37354a

Signed-off-by: Igoshev, Iaroslav <[email protected]>

rhshadrach approved these changes Mar 2, 2024

View reviewed changes

HyukjinKwon reviewed Mar 6, 2024

View reviewed changes

Merge branch 'main' into dev/yigoshev-issue57585

23b2891

datapythonista approved these changes Mar 7, 2024

View reviewed changes

Merge branch 'main' into dev/yigoshev-issue57585

b8dfa73

mroeschke added this to the 3.0 milestone Mar 7, 2024

mroeschke approved these changes Mar 7, 2024

View reviewed changes

mroeschke merged commit c9f876c into pandas-dev:main Mar 7, 2024
18 checks passed

mroeschke mentioned this pull request Mar 13, 2024

DOC: Remove Dask and Modin sections in scale.rst in favor of linking to ecosystem docs. #57831

Closed

		@@ -374,5 +374,33 @@ datasets.

		You see more dask examples at https://examples.dask.org.

		Use Modin

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

DOC-#57585: Add Use Modin section on Scaling to large datasets page #57586

Conversation

YarShev commented Feb 23, 2024

YarShev commented Feb 23, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Feb 26, 2024

YarShev commented Feb 27, 2024

datapythonista commented Feb 27, 2024

YarShev commented Feb 28, 2024

datapythonista commented Feb 28, 2024

YarShev commented Feb 28, 2024

rhshadrach commented Feb 28, 2024

HyukjinKwon commented Feb 29, 2024 • edited Loading

datapythonista commented Feb 29, 2024

YarShev commented Feb 29, 2024

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Mar 1, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

YarShev commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

YarShev commented Mar 7, 2024

mroeschke commented Mar 7, 2024

DOC-#57585: Add `Use Modin` section on `Scaling to large datasets` page #57586

DOC-#57585: Add `Use Modin` section on `Scaling to large datasets` page #57586

rhshadrach Feb 26, 2024 •

edited

Loading

HyukjinKwon commented Feb 29, 2024 •

edited

Loading