From 45801970e88ed07aea70213cc752d7e96d4cf632 Mon Sep 17 00:00:00 2001 From: "Igoshev, Iaroslav" Date: Fri, 23 Feb 2024 14:50:07 +0000 Subject: [PATCH 1/6] DOC-#57585: Add `Use Modin` section on `Scaling to large datasets` page Signed-off-by: Igoshev, Iaroslav --- doc/source/user_guide/scale.rst | 96 +++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index b262de5d71439..51a08437572c4 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -374,5 +374,101 @@ datasets. You see more dask examples at https://examples.dask.org. +Use Modin +--------- + +Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and +provides the ability to scale pandas workflows across nodes and CPUs available and +to work with larger than memory datasets. To start working with Modin you just need +to replace a single line of code, namely, the import statement. + +.. ipython:: python + + # import pandas as pd + import modin.pandas as pd + +After you have changed the import statement, you can proceed using the well-known pandas API +to scale computation. Modin distributes computation across nodes and CPUs available utilizing +an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported +in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it +along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and +the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas +and perform a simple operation on the data. + +.. ipython:: python + + import pandas + import modin.pandas as pd + import numpy as np + + array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8)) + filename = "example.csv" + np.savetxt(filename, array, delimiter=",") + + %time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) + CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s + Wall time: 52.5 s + %time pandas_df = pandas_df.map(lambda x: x + 0.01) + CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s + Wall time: 56.5 s + + %time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) + CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s + Wall time: 17.5 s + %time modin_df = modin_df.map(lambda x: x + 0.01) + CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s + Wall time: 2.54 s + +We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution. +Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. +It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation +in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle +heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. +What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin. + +.. ipython:: python + + from modin.pandas.io import to_pandas + + %%time + pandas_subset = pandas_df.iloc[:100000] + for col in pandas_subset.columns: + pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() + CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms + Wall time: 293 ms + + %%time + modin_subset = modin_df.iloc[:100000] + for col in modin_subset.columns: + modin_subset[col] = modin_subset[col] / modin_subset[col].sum() + CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s + Wall time: 20.9 s + + %%time + pandas_subset = to_pandas(modin_df.iloc[:100000]) + for col in pandas_subset.columns: + pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() + CPU times: user 566 ms, sys: 279 ms, total: 845 ms + Wall time: 731 ms + +You could also rewrite this code a bit to get the same result with much less execution time. + +.. ipython:: python + + %%time + modin_subset = modin_df.iloc[:100000] + modin_subset = modin_subset / modin_subset.sum(axis=0) + CPU times: user 531 ms, sys: 134 ms, total: 666 ms + Wall time: 374 ms + +For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away +to start scaling pandas operations with Modin and an execution engine you like. + +.. _Modin: https://github.com/modin-project/modin +.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest +.. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution +.. _Ray: https://github.com/ray-project/ray .. _Dask: https://dask.org +.. _`MPI through unidist`: https://github.com/modin-project/unidist +.. _HDK: https://github.com/intel-ai/hdk .. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html From be9f1e7229729b93d6540a7a3d2b8fcc23ecd8ac Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 23 Feb 2024 15:12:50 +0000 Subject: [PATCH 2/6] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- doc/source/user_guide/scale.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 51a08437572c4..2b48ebe807cb0 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -411,7 +411,7 @@ and perform a simple operation on the data. %time pandas_df = pandas_df.map(lambda x: x + 0.01) CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s Wall time: 56.5 s - + %time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s Wall time: 17.5 s From f0d6d07ede6142d1b3a0972401ac4303a09592e6 Mon Sep 17 00:00:00 2001 From: "Igoshev, Iaroslav" Date: Mon, 26 Feb 2024 13:56:12 +0000 Subject: [PATCH 3/6] Address comments Signed-off-by: Igoshev, Iaroslav --- doc/source/user_guide/scale.rst | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 2b48ebe807cb0..cedf7229924ef 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -378,11 +378,11 @@ Use Modin --------- Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and -provides the ability to scale pandas workflows across nodes and CPUs available and +provides the ability to scale pandas workflows across nodes and CPUs available. It is also able to work with larger than memory datasets. To start working with Modin you just need to replace a single line of code, namely, the import statement. -.. ipython:: python +.. code-block:: ipython # import pandas as pd import modin.pandas as pd @@ -395,7 +395,7 @@ along both columns and rows because it gives Modin flexibility and scalability i the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas and perform a simple operation on the data. -.. ipython:: python +.. code-block:: ipython import pandas import modin.pandas as pd @@ -423,10 +423,10 @@ We can see that Modin has been able to perform the operations much faster than p Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle -heavy tasks, rather a small number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. +heavy tasks rather than a big number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin. -.. ipython:: python +.. code-block:: ipython from modin.pandas.io import to_pandas @@ -453,7 +453,13 @@ What you can do for now in such a case is to use pandas for the cases where it i You could also rewrite this code a bit to get the same result with much less execution time. -.. ipython:: python +.. code-block:: ipython + + %%time + pandas_subset = pandas_df.iloc[:100000] + pandas_subset = pandas_subset / pandas_subset.sum(axis=0) + CPU times: user 105 ms, sys: 97.5 ms, total: 202 ms + Wall time: 72.2 ms %%time modin_subset = modin_df.iloc[:100000] From a6521ca7468597f3ff0b800c33d6a948eb34e5b7 Mon Sep 17 00:00:00 2001 From: "Igoshev, Iaroslav" Date: Tue, 27 Feb 2024 10:49:05 +0000 Subject: [PATCH 4/6] Address comments Signed-off-by: Igoshev, Iaroslav --- doc/source/user_guide/scale.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index cedf7229924ef..f7ca87ad9b8fe 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -377,7 +377,7 @@ You see more dask examples at https://examples.dask.org. Use Modin --------- -Modin_ is a scalable dataframe library, which has a drop-in replacement API for pandas and +Modin_ is a scalable dataframe library, which aims to be a drop-in replacement API for pandas and provides the ability to scale pandas workflows across nodes and CPUs available. It is also able to work with larger than memory datasets. To start working with Modin you just need to replace a single line of code, namely, the import statement. @@ -393,7 +393,7 @@ an execution engine it runs on. At the time of Modin 0.27.0 the following execut in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas -and perform a simple operation on the data. +and perform a custom operation on the data. .. code-block:: ipython @@ -408,22 +408,22 @@ and perform a simple operation on the data. %time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s Wall time: 52.5 s - %time pandas_df = pandas_df.map(lambda x: x + 0.01) - CPU times: user 48.7 s, sys: 7.8 s, total: 56.5 s - Wall time: 56.5 s + %time pandas_df_map = pandas_df.map(lambda x: len(str(x))) + CPU times: user 1min 46s, sys: 3.73 s, total: 1min 50s + Wall time: 1min 50s %time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s Wall time: 17.5 s - %time modin_df = modin_df.map(lambda x: x + 0.01) - CPU times: user 5.74 s, sys: 1e+03 ms, total: 6.74 s - Wall time: 2.54 s + %time modin_df_map = modin_df.map(lambda x: len(str(x))) + CPU times: user 7.45 s, sys: 1.2 s, total: 8.64 s + Wall time: 3.38 s We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution. Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle -heavy tasks rather than a big number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. +heavy tasks rather than a large number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin. .. code-block:: ipython @@ -434,8 +434,8 @@ What you can do for now in such a case is to use pandas for the cases where it i pandas_subset = pandas_df.iloc[:100000] for col in pandas_subset.columns: pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() - CPU times: user 210 ms, sys: 84.4 ms, total: 294 ms - Wall time: 293 ms + CPU times: user 189 ms, sys: 76.4 ms, total: 266 ms + Wall time: 264 ms %%time modin_subset = modin_df.iloc[:100000] @@ -458,8 +458,8 @@ You could also rewrite this code a bit to get the same result with much less exe %%time pandas_subset = pandas_df.iloc[:100000] pandas_subset = pandas_subset / pandas_subset.sum(axis=0) - CPU times: user 105 ms, sys: 97.5 ms, total: 202 ms - Wall time: 72.2 ms + CPU times: user 104 ms, sys: 3.42 ms, total: 107 ms + Wall time: 58.9 ms %%time modin_subset = modin_df.iloc[:100000] @@ -471,7 +471,7 @@ For more information refer to `Modin's documentation`_ or dip into `Modin's tuto to start scaling pandas operations with Modin and an execution engine you like. .. _Modin: https://github.com/modin-project/modin -.. _`Modin's documetation`: https://modin.readthedocs.io/en/latest +.. _`Modin's documentation`: https://modin.readthedocs.io/en/latest .. _`Modin's tutorials`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution .. _Ray: https://github.com/ray-project/ray .. _Dask: https://dask.org From 0e45ac461baa20cc5e7702194520372476c80c87 Mon Sep 17 00:00:00 2001 From: "Igoshev, Iaroslav" Date: Thu, 29 Feb 2024 13:16:30 +0000 Subject: [PATCH 5/6] Revert some changes Signed-off-by: Igoshev, Iaroslav --- doc/source/user_guide/scale.rst | 75 +-------------------------------- 1 file changed, 1 insertion(+), 74 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index f7ca87ad9b8fe..c9ba6aa7e850c 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -392,80 +392,7 @@ to scale computation. Modin distributes computation across nodes and CPUs availa an execution engine it runs on. At the time of Modin 0.27.0 the following execution engines are supported in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of a Modin DataFrame partitions it along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and -the number of rows. Let's take a look at how we can read the data from a CSV file with Modin the same way as with pandas -and perform a custom operation on the data. - -.. code-block:: ipython - - import pandas - import modin.pandas as pd - import numpy as np - - array = np.random.randint(low=0.1, high=1.0, size=(2 ** 20, 2 ** 8)) - filename = "example.csv" - np.savetxt(filename, array, delimiter=",") - - %time pandas_df = pandas.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) - CPU times: user 48.3 s, sys: 4.23 s, total: 52.5 s - Wall time: 52.5 s - %time pandas_df_map = pandas_df.map(lambda x: len(str(x))) - CPU times: user 1min 46s, sys: 3.73 s, total: 1min 50s - Wall time: 1min 50s - - %time modin_df = pd.read_csv(filename, names=[f"col{i}" for i in range(2 ** 8)]) - CPU times: user 9.49 s, sys: 2.72 s, total: 12.2 s - Wall time: 17.5 s - %time modin_df_map = modin_df.map(lambda x: len(str(x))) - CPU times: user 7.45 s, sys: 1.2 s, total: 8.64 s - Wall time: 3.38 s - -We can see that Modin has been able to perform the operations much faster than pandas due to distributing execution. -Even though Modin aims to speed up each single pandas operation, there are the cases when pandas outperforms. -It might be a case if the data size is relatively small or Modin hasn't implemented yet a certain operation -in an efficient way. Also, for-loops is an antipattern for Modin since Modin has initially been designed to efficiently handle -heavy tasks rather than a large number of small ones. Yet, Modin is actively working on eliminating all these drawbacks. -What you can do for now in such a case is to use pandas for the cases where it is more beneficial than Modin. - -.. code-block:: ipython - - from modin.pandas.io import to_pandas - - %%time - pandas_subset = pandas_df.iloc[:100000] - for col in pandas_subset.columns: - pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() - CPU times: user 189 ms, sys: 76.4 ms, total: 266 ms - Wall time: 264 ms - - %%time - modin_subset = modin_df.iloc[:100000] - for col in modin_subset.columns: - modin_subset[col] = modin_subset[col] / modin_subset[col].sum() - CPU times: user 18.2 s, sys: 2.35 s, total: 20.5 s - Wall time: 20.9 s - - %%time - pandas_subset = to_pandas(modin_df.iloc[:100000]) - for col in pandas_subset.columns: - pandas_subset[col] = pandas_subset[col] / pandas_subset[col].sum() - CPU times: user 566 ms, sys: 279 ms, total: 845 ms - Wall time: 731 ms - -You could also rewrite this code a bit to get the same result with much less execution time. - -.. code-block:: ipython - - %%time - pandas_subset = pandas_df.iloc[:100000] - pandas_subset = pandas_subset / pandas_subset.sum(axis=0) - CPU times: user 104 ms, sys: 3.42 ms, total: 107 ms - Wall time: 58.9 ms - - %%time - modin_subset = modin_df.iloc[:100000] - modin_subset = modin_subset / modin_subset.sum(axis=0) - CPU times: user 531 ms, sys: 134 ms, total: 666 ms - Wall time: 374 ms +the number of rows. For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away to start scaling pandas operations with Modin and an execution engine you like. From f37354ac78d5dd6a5bff113f30c042e0a072513c Mon Sep 17 00:00:00 2001 From: "Igoshev, Iaroslav" Date: Fri, 1 Mar 2024 09:21:05 +0000 Subject: [PATCH 6/6] Address comments Signed-off-by: Igoshev, Iaroslav --- doc/source/user_guide/scale.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index c9ba6aa7e850c..080f8484ce969 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -394,8 +394,7 @@ in Modin: Ray_, Dask_, `MPI through unidist`_, HDK_. The partitioning schema of along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and the number of rows. -For more information refer to `Modin's documentation`_ or dip into `Modin's tutorials`_ right away -to start scaling pandas operations with Modin and an execution engine you like. +For more information refer to `Modin's documentation`_ or the `Modin's tutorials`_. .. _Modin: https://github.com/modin-project/modin .. _`Modin's documentation`: https://modin.readthedocs.io/en/latest