From 59ecbfbf2833e9c769fba1aa08ed99ba7ecc29c4 Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Fri, 26 Feb 2016 15:55:44 -0500 Subject: [PATCH 1/4] DOC: expanding comparison with R section This is the beginning of a quick reference section. It's incomplete, just did a rough translation of http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/TomAugspurger/6e052140eaa5fdb6e8c0/raw/811585624e843f3f80b9b6fe89e18119d7d2d73c/dplyr_pandas.ipynb into tables. Should try to get some R experts to comment, and it would be nice to have the pandas versions link to docs for the functions being used, but I'm terrible at reStructuredText and gave up for the moment. --- doc/source/comparison_with_r.rst | 73 ++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index 0841f3354d160..a104cabae4ba7 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -31,6 +31,79 @@ For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to use HDF5 files, see :ref:`io.external_compatibility` for an example. + +Quick Reference +--------------- + +We'll start off with a quick reference guide pairing some common R +operations using `dplyr +`__ with +pandas equivalents. + + +Querying, Filtering, Sampling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``dim(df)`` ``df.shape`` +``head(df)`` ``df.head()`` +``slice(df, 1:10)`` ``df.iloc[:9]`` +``filter(df, col1 == 1, col2 == 1)`` ``df.query('col1 == 1 & col2 == 1')`` +``df[df$col1 == 1 & df$col2 == 1,]`` ``df[(df.col1 == 1) & (df.col2 == 1)]`` +``select(df, col1, col2)`` ``df[['col1', 'col2']]`` +``select(df, col1:col3)`` No one-line equivalent, but see [#select_range]_ +``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_ +``distinct(select(df, col1))`` ``df.col1.unique()`` +``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()`` +``sample_n(df, 10)`` ``df.loc[np.random.choice(df.index, 10)]`` +``sample_frac(df, 0.01)`` ``df.iloc[np.random.randint(0, len(df), .01 * len(flights))]`` +=========================================== =========================================== + +.. [#select_range] R's shorthand for a subrange of columns + (``select(df, col1:col3)``) can be approached + cleanly in pandas, if you have the list of columns, + for example ``df[cols[1:3]]`` or + ``df.drop(cols[1:3])``, but doing this by column + name is a bit messy. + + +Sorting +~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``arrange(df, col1, col2)`` ``df.sort(['col1', 'col2'])`` +``arrange(df, desc(col1))`` ``df.sort('col1', ascending=False)`` +=========================================== =========================================== + +Transforming +~~~~~~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']`` +``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})`` +``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)`` +=========================================== =========================================== + + +Grouping and Summarizing +~~~~~~~~~~~~~~~~~~~~~~~~ + +============================================== =========================================== +R pandas +============================================== =========================================== +``summary(df)`` ``df.describe()`` +``gdf <- group_by(df, col1)`` ``gdf = df.groupby('col1')`` +``summarise(gdf, avg=mean(col1, na.rm=TRUE))`` ``df.groupby('col1').agg({'col1': 'mean'})`` +``summarise(gdf, total=sum(col1))`` ``df.groupby('col1').sum()`` +============================================== =========================================== + + Base R ------ From 2e1ed94759cf7340ec1726bcce4311cb76d2d9f9 Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Fri, 26 Feb 2016 22:29:42 -0500 Subject: [PATCH 2/4] simplified sample thanks to @jorisvandenbossche --- doc/source/comparison_with_r.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index a104cabae4ba7..edcf92ce510f4 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -57,8 +57,8 @@ R pandas ``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_ ``distinct(select(df, col1))`` ``df.col1.unique()`` ``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()`` -``sample_n(df, 10)`` ``df.loc[np.random.choice(df.index, 10)]`` -``sample_frac(df, 0.01)`` ``df.iloc[np.random.randint(0, len(df), .01 * len(flights))]`` +``sample_n(df, 10)`` ``df.sample(n=10)`` +``sample_frac(df, 0.01)`` ``df.sample(frac=0.01)`` =========================================== =========================================== .. [#select_range] R's shorthand for a subrange of columns From f525ae7aba7a85553d2b2a50614261a1a7aa70f4 Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Mon, 18 Apr 2016 11:31:58 -0400 Subject: [PATCH 3/4] addressed CR comments - sort -> sort_values - unique -> drop_duplicates --- doc/source/comparison_with_r.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index edcf92ce510f4..159b63d6c5551 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -55,7 +55,7 @@ R pandas ``select(df, col1, col2)`` ``df[['col1', 'col2']]`` ``select(df, col1:col3)`` No one-line equivalent, but see [#select_range]_ ``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_ -``distinct(select(df, col1))`` ``df.col1.unique()`` +``distinct(select(df, col1))`` ``df[['col1']].drop_duplicates()`` ``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()`` ``sample_n(df, 10)`` ``df.sample(n=10)`` ``sample_frac(df, 0.01)`` ``df.sample(frac=0.01)`` @@ -75,8 +75,8 @@ Sorting =========================================== =========================================== R pandas =========================================== =========================================== -``arrange(df, col1, col2)`` ``df.sort(['col1', 'col2'])`` -``arrange(df, desc(col1))`` ``df.sort('col1', ascending=False)`` +``arrange(df, col1, col2)`` ``df.sort_values(['col1', 'col2'])`` +``arrange(df, desc(col1))`` ``df.sort_values('col1', ascending=False)`` =========================================== =========================================== Transforming From 808eba1554d07fdfb1c194c29af4aef5a278fd7a Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Mon, 18 Apr 2016 11:55:25 -0400 Subject: [PATCH 4/4] added select range of columns --- doc/source/comparison_with_r.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index 159b63d6c5551..fad3d034c8d17 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -53,7 +53,7 @@ R pandas ``filter(df, col1 == 1, col2 == 1)`` ``df.query('col1 == 1 & col2 == 1')`` ``df[df$col1 == 1 & df$col2 == 1,]`` ``df[(df.col1 == 1) & (df.col2 == 1)]`` ``select(df, col1, col2)`` ``df[['col1', 'col2']]`` -``select(df, col1:col3)`` No one-line equivalent, but see [#select_range]_ +``select(df, col1:col3)`` ``df.loc[:, 'col1':'col3']`` ``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_ ``distinct(select(df, col1))`` ``df[['col1']].drop_duplicates()`` ``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()``