-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC-#57585: Add Use Modin
section on Scaling to large datasets
page
#57586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4f981cb
to
4580197
Compare
…tasets` page Signed-off-by: Igoshev, Iaroslav <[email protected]>
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
doc/source/user_guide/scale.rst
Outdated
array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8)) | ||
filename = "example.csv" | ||
np.savetxt(filename, array, delimiter=",") | ||
|
||
%time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) | ||
CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s | ||
Wall time: 52.5 s | ||
%time pandas_df = pandas_df.map(lambda x: x + 0.01) | ||
CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s | ||
Wall time: 56.5 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you mean for this to be a static code block - use .. code-block:: ipython
instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.
In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is topandas_df = pandas_df + 0.01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you mean for this to be a static code block - use
.. code-block:: ipython
instead. Otherwise this will be run when building the docs, leaving behind a large file and adding a significant amount of build time.
Applied .. code-block:: ipython
. Thanks.
In addition, I think comparing to an ill-performant pandas version is misleading. A proper comparison here is to
pandas_df = pandas_df + 0.01
I understand that the binary operation is much more efficient than appliying a function with map
. Since the user may apply any function on the data, I wanted to show how it performs with pandas and Modin. Do you mind we leave this as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind we leave this as is?
It seems to me that a reader could easily come away from the existing example thinking "Modin can speed up simple arithmetic operations by 25x over pandas" - a statement which is not true. In fact, I'm seeing 0.26 seconds when doing pandas_df + 0.01
on my machine. Because of this - I do mind.
I see a few ways forward. I think you are trying to convey the benefit of using arbitrary Python functions with Modin - this could be stated up front explicitly and a note at the end benchmarking the more performant pandas version. Alternatively, you could use a Python function for which there is no simple pandas equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grabbed an example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html to use a custom function for applying on the data. Hope it is okay for now.
doc/source/user_guide/scale.rst
Outdated
Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. | ||
It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation | ||
in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle | ||
heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than a number of small ones
I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I meant rather a big number of small ones
. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "big number" sounded off to me for some reason. A couple of quick searches I'm seeing very consistent recommendations to use "large number" instead, e.g. Cambridge Dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
doc/source/user_guide/scale.rst
Outdated
You could also rewrite this code a bit to get the same result with much less execution time. | ||
|
||
.. ipython:: python | ||
|
||
%%time | ||
modin_subset = modin_df.iloc[:100000] | ||
modin_subset = modin_subset / modin_subset.sum(axis=0) | ||
CPU times: user 531 ms, sys: 134 ms, total: 666 ms | ||
Wall time: 374 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could be misleading, but is in fact not. It seems to me a reader might think that the same change to the pandas code would also decrease its execution time but pandas remains roughly the same. I think it'd be clearer to add timing for the pandas equivalent here too:
pandas_subset = pandas_df.iloc[:100000]
pandas_subset = pandas_subset / pandas_subset.sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing the same thing on my machine.
L438: You report 293ms, I get 52ms
L462: You report 72.2ms, I get 69ms
Can you check your timings again? If they hold up, what version of pandas/NumPy are you running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time might be different for my and your runs. I have Intel(R) Xeon(R) Platinum 8468. Reran again and updated the timings. I have these pandas and numpy versions installed.
pandas 2.2.0
numpy 1.26.0
Signed-off-by: Igoshev, Iaroslav <[email protected]>
@rhshadrach, could you shed some light on why |
doc/source/user_guide/scale.rst
Outdated
Use Modin | ||
--------- | ||
|
||
Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has a drop-in" -> "aims to be a drop-in" would be more accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel, could you please elaborate on what gave you this opinion? The user can proceed using pandas API after replacing the import statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience working on modin lots of behaviors didn't quite match. Recall I presented a ModinExtensionArray and about half of the tests could not be made to pass because of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks! Yes, "aims to be a drop-in" would be more accurate. Fixed.
L474, documentation is spelled incorrectly. |
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Oh, I see, thanks! FIxed. |
I see the Dask section was added 5 years ago, where there were probably no other alternatives. I think it's surely valuable for users to know about other tools to distribute work with a pandas API, but I don't think our user guide is the place to have and have to maintain a short tutorial of Dask. Modin, PySpark/Koalas and any other system that provides a distributed pandas-like API. What do you think if you have in each project documentation a page for pandas users that we can simply link here? @phofl @YarShev @HyukjinKwon |
While other projects can have a documentation page about their tool and pandas as to which library to use and when, I think mentioning other alternatives to pandas in pandas documentation itself is valuable to users since pandas is much more popular than the alternatives, and many more people will be able to see concrete options for efficient data processing. |
I don't disagree with you, and I think it's valuable for users to know the other packages in the ecosystem to scale what pandas can't handle. My point is that in my opinion the pandas codebase is not where so much information from Dask or Modin should exist. I think we should have a paragraph so users know about the tools and what to expect about them, and all the details could be in a landing page for pandas users that you maintain in your docs, keep updated, and that we simply have a link here next to the summary of what each tool does. Just my personal opinion, but this should be as good for the users, but easier to maintain and keep updated. |
I tried to make the changes as concise as possible giving some pros and cons of Modin and when it should be used. I don't see anything in the changes that could be hard to maintain in future. If you have any improvements we could introduce in the paragraph, please let me know. It would be also great to hear opinions from other reviewers. cc @rhshadrach, @jbrockmendel |
Building on what @datapythonista said - I don't see much advantage in having the documentation as it is in this PR vs just the intro paragraph ending with a line like
Changes to any of the third party packages currently listed would necessitate a change to the pandas documentation and is therefore quite likely to go stale. In addition each new tool that comes along we have to review another PR. It is all around less maintenance for us to have a summary paragraph and a link to the projects documentation where they compare to pandas. What is the disadvantage by having a link rather than the documentation live in pandas itself? |
@datapythonista what about pandas here points out the tutorial of individual project? Basically I support:
We won't have to duplicate the tutorials in pandas page, and individual project page. Apache Spark is sort of doing in the same way: https://spark.apache.org/third-party-projects.html FWIW, Koalas has been migrated to PySpark as pandas API on Spark component so I have to update it in any event. |
Sounds great. My idea is that users reaching our page about scaling should know about the relevant options, but without pandas having lots of details in our docs when we can simply link to each projects docs. The exact page you link to I don't know, maybe a dedicated landing page for users coming from pandas, maybe the home of the Koalas documentation, no strong preference from my side. Also worth mentioning that we already do it in the Ecosystem page in our website. Maybe for now I can just copy what it's there in a separate PR, and you all can review it or follow up with any change if needed. What do you think? |
Signed-off-by: Igoshev, Iaroslav <[email protected]>
@datapythonista, @HyukjinKwon, @rhshadrach, all your thoughts make sense to me now. We should probably put the detailed information about Modin and its joint usage with pandas in a single flow to Modin's documentation. I left three small paragraphs about Modin in this PR. How does it look like to you now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the flexibility and for updating this @YarShev. I added couple of very opinionated comments that in my opinion would make this a bit more concise and to the point. Feel free to disagree if you don't like my comments.
--------- | ||
|
||
Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and | ||
provides the ability to scale pandas workflows across nodes and CPUs available. It is also able |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, but maybe you can add the part about execution engines here, since in the next paragraph you say Modin distributes computation across nodes and CPUs available
which seems to pretty much repeat the same thing again to conextualize that part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to rephrase the first and the second paragraphs by following your comment. To me the current state of paragraphs is pretty concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this intro paragraph aims to be a high level description of Modin. With that in mind, getting into execution engines here seems undesirable.
in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it | ||
along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and | ||
the number of rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's just me, but the part about partitioning seems too technical and out of context for this short intro. Feel free to disagree if you think it's key for users to know this asap when considering Modin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this page is not a getting started, I think it is fine to say some words here about partitioning, which is flexible along columns and row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being both columnar and row-wise is an important feature to mention (since some things only scale well in one axis).
doc/source/user_guide/scale.rst
Outdated
the number of rows. | ||
|
||
For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away | ||
to start scaling pandas operations with Modin and an execution engine you like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm being picky, but since this is in the pandas docs, I'd remove the last part that sounds like the end of a sales pitch. ;)
In my opinion, just: "For more information refer to Modin's documentation
_ or the Modin's tutorials
_. " would feel more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you. Changed.
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Ready for the next round of review. Hope it is fine now 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @YarShev - this looks good to me pending the current discussions.
@datapythonista, @HyukjinKwon, just a friendly reminder for review. |
@@ -374,5 +374,33 @@ datasets. | |||
|
|||
You see more dask examples at https://examples.dask.org. | |||
|
|||
Use Modin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents. I would just simply remove both Dask and Modin here, add a link to
pandas/web/pandas/community/ecosystem.md
Line 344 in a0784d2
## Out-of-core |
A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, I would at least deduplicate them here if we want to keep those sections. e.g.,
# Using thirdparty libraries
multiple nodes blah blah..
## Dask
oneliner explanation
Link
## Modin
oneliner explanation
Link
## PySpark: Pandas API on Spark
oneliner explanation
Link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of information here is duplicated with Dask vs Modin and others too such as Pandas API on Spark (Koalas). They all aim drop-in replacement, they all out-of-core in multiple nodes, partitioning, scalability.
I would argue against the statement They all aim drop-in replacement
. Modin is supposed to work after changing the import statement import pandas as pd
to import modin.pandas as pd
and vice versa, while Dask or Koalas require the code to be rewritten to the best of my knowledge. Anyway, if everyone thinks that Scaling to large datasets
page should only contain pandas related stuff, I can go ahead and remove Modin and Dask related stuff from that page and move the changes from this PR to the ecosystem page. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a maintainer here and don't have a super strong opinion. I'll defer to pandas maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @HyukjinKwon about the one-liner, as I mentioned earlier I find the way things are phrased now too verbose and a bit repetitive considering we want something concise, but no big deal, I'm ok to get this merged as is, and we can iterate further, once we reduce the Dask section and add PySpark/Koalas.
@datapythonista, sounds good! Could you please tag me in the future PRs where you would reduce the Dask section and and PySpark/Koalas? I would like Modin to be in line with those libraries sections. |
Thanks @YarShev. Agreed that eventually this should just link to the ecosystem docs, but this is good as is. |
…tasets` page (pandas-dev#57586) * DOC-pandas-dev#57585: Add `Use Modin` section on `Scaling to large datasets` page Signed-off-by: Igoshev, Iaroslav <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> * Revert some changes Signed-off-by: Igoshev, Iaroslav <[email protected]> * Address comments Signed-off-by: Igoshev, Iaroslav <[email protected]> --------- Signed-off-by: Igoshev, Iaroslav <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Marc Garcia <[email protected]>
Use Modin
section onScaling to large datasets
page #57585 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.