From 59ecbfbf2833e9c769fba1aa08ed99ba7ecc29c4 Mon Sep 17 00:00:00 2001
From: Leif Walsh <leif@twosigma.com>
Date: Fri, 26 Feb 2016 15:55:44 -0500
Subject: [PATCH 1/4] DOC: expanding comparison with R section

This is the beginning of a quick reference section.  It's incomplete,
just did a rough translation of
http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/TomAugspurger/6e052140eaa5fdb6e8c0/raw/811585624e843f3f80b9b6fe89e18119d7d2d73c/dplyr_pandas.ipynb
into tables.  Should try to get some R experts to comment, and it would
be nice to have the pandas versions link to docs for the functions being
used, but I'm terrible at reStructuredText and gave up for the moment.
---
 doc/source/comparison_with_r.rst | 73 ++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst
index 0841f3354d160..a104cabae4ba7 100644
--- a/doc/source/comparison_with_r.rst
+++ b/doc/source/comparison_with_r.rst
@@ -31,6 +31,79 @@ For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to
 use HDF5 files, see :ref:`io.external_compatibility` for an
 example.
 
+
+Quick Reference
+---------------
+
+We'll start off with a quick reference guide pairing some common R
+operations using `dplyr
+<http://cran.r-project.org/web/packages/dplyr/index.html>`__ with
+pandas equivalents.
+
+
+Querying, Filtering, Sampling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+===========================================  ===========================================
+R                                            pandas
+===========================================  ===========================================
+``dim(df)``                                  ``df.shape``
+``head(df)``                                 ``df.head()``
+``slice(df, 1:10)``                          ``df.iloc[:9]``
+``filter(df, col1 == 1, col2 == 1)``         ``df.query('col1 == 1 & col2 == 1')``
+``df[df$col1 == 1 & df$col2 == 1,]``         ``df[(df.col1 == 1) & (df.col2 == 1)]``
+``select(df, col1, col2)``                   ``df[['col1', 'col2']]``
+``select(df, col1:col3)``                    No one-line equivalent, but see [#select_range]_
+``select(df, -(col1:col3))``                 ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
+``distinct(select(df, col1))``               ``df.col1.unique()``
+``distinct(select(df, col1, col2))``         ``df[['col1', 'col2']].drop_duplicates()``
+``sample_n(df, 10)``                         ``df.loc[np.random.choice(df.index, 10)]``
+``sample_frac(df, 0.01)``                    ``df.iloc[np.random.randint(0, len(df), .01 * len(flights))]``
+===========================================  ===========================================
+
+.. [#select_range] R's shorthand for a subrange of columns
+                   (``select(df, col1:col3)``) can be approached
+                   cleanly in pandas, if you have the list of columns,
+                   for example ``df[cols[1:3]]`` or
+                   ``df.drop(cols[1:3])``, but doing this by column
+                   name is a bit messy.
+
+
+Sorting
+~~~~~~~
+
+===========================================  ===========================================
+R                                            pandas
+===========================================  ===========================================
+``arrange(df, col1, col2)``                  ``df.sort(['col1', 'col2'])``
+``arrange(df, desc(col1))``                  ``df.sort('col1', ascending=False)``
+===========================================  ===========================================
+
+Transforming
+~~~~~~~~~~~~
+
+===========================================  ===========================================
+R                                            pandas
+===========================================  ===========================================
+``select(df, col_one = col1)``               ``df.rename(columns={'col1': 'col_one'})['col_one']``
+``rename(df, col_one = col1)``               ``df.rename(columns={'col1': 'col_one'})``
+``mutate(df, c=a-b)``                        ``df.assign(c=df.a-df.b)``
+===========================================  ===========================================
+
+
+Grouping and Summarizing
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+==============================================  ===========================================
+R                                               pandas
+==============================================  ===========================================
+``summary(df)``                                 ``df.describe()``
+``gdf <- group_by(df, col1)``                   ``gdf = df.groupby('col1')``
+``summarise(gdf, avg=mean(col1, na.rm=TRUE))``  ``df.groupby('col1').agg({'col1': 'mean'})``
+``summarise(gdf, total=sum(col1))``             ``df.groupby('col1').sum()``
+==============================================  ===========================================
+
+
 Base R
 ------
 

From 2e1ed94759cf7340ec1726bcce4311cb76d2d9f9 Mon Sep 17 00:00:00 2001
From: Leif Walsh <leif.walsh@gmail.com>
Date: Fri, 26 Feb 2016 22:29:42 -0500
Subject: [PATCH 2/4] simplified sample thanks to @jorisvandenbossche

---
 doc/source/comparison_with_r.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst
index a104cabae4ba7..edcf92ce510f4 100644
--- a/doc/source/comparison_with_r.rst
+++ b/doc/source/comparison_with_r.rst
@@ -57,8 +57,8 @@ R                                            pandas
 ``select(df, -(col1:col3))``                 ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
 ``distinct(select(df, col1))``               ``df.col1.unique()``
 ``distinct(select(df, col1, col2))``         ``df[['col1', 'col2']].drop_duplicates()``
-``sample_n(df, 10)``                         ``df.loc[np.random.choice(df.index, 10)]``
-``sample_frac(df, 0.01)``                    ``df.iloc[np.random.randint(0, len(df), .01 * len(flights))]``
+``sample_n(df, 10)``                         ``df.sample(n=10)``
+``sample_frac(df, 0.01)``                    ``df.sample(frac=0.01)``
 ===========================================  ===========================================
 
 .. [#select_range] R's shorthand for a subrange of columns

From f525ae7aba7a85553d2b2a50614261a1a7aa70f4 Mon Sep 17 00:00:00 2001
From: Leif Walsh <leif.walsh@gmail.com>
Date: Mon, 18 Apr 2016 11:31:58 -0400
Subject: [PATCH 3/4] addressed CR comments

 - sort -> sort_values
 - unique -> drop_duplicates
---
 doc/source/comparison_with_r.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst
index edcf92ce510f4..159b63d6c5551 100644
--- a/doc/source/comparison_with_r.rst
+++ b/doc/source/comparison_with_r.rst
@@ -55,7 +55,7 @@ R                                            pandas
 ``select(df, col1, col2)``                   ``df[['col1', 'col2']]``
 ``select(df, col1:col3)``                    No one-line equivalent, but see [#select_range]_
 ``select(df, -(col1:col3))``                 ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
-``distinct(select(df, col1))``               ``df.col1.unique()``
+``distinct(select(df, col1))``               ``df[['col1']].drop_duplicates()``
 ``distinct(select(df, col1, col2))``         ``df[['col1', 'col2']].drop_duplicates()``
 ``sample_n(df, 10)``                         ``df.sample(n=10)``
 ``sample_frac(df, 0.01)``                    ``df.sample(frac=0.01)``
@@ -75,8 +75,8 @@ Sorting
 ===========================================  ===========================================
 R                                            pandas
 ===========================================  ===========================================
-``arrange(df, col1, col2)``                  ``df.sort(['col1', 'col2'])``
-``arrange(df, desc(col1))``                  ``df.sort('col1', ascending=False)``
+``arrange(df, col1, col2)``                  ``df.sort_values(['col1', 'col2'])``
+``arrange(df, desc(col1))``                  ``df.sort_values('col1', ascending=False)``
 ===========================================  ===========================================
 
 Transforming

From 808eba1554d07fdfb1c194c29af4aef5a278fd7a Mon Sep 17 00:00:00 2001
From: Leif Walsh <leif.walsh@gmail.com>
Date: Mon, 18 Apr 2016 11:55:25 -0400
Subject: [PATCH 4/4] added select range of columns

---
 doc/source/comparison_with_r.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst
index 159b63d6c5551..fad3d034c8d17 100644
--- a/doc/source/comparison_with_r.rst
+++ b/doc/source/comparison_with_r.rst
@@ -53,7 +53,7 @@ R                                            pandas
 ``filter(df, col1 == 1, col2 == 1)``         ``df.query('col1 == 1 & col2 == 1')``
 ``df[df$col1 == 1 & df$col2 == 1,]``         ``df[(df.col1 == 1) & (df.col2 == 1)]``
 ``select(df, col1, col2)``                   ``df[['col1', 'col2']]``
-``select(df, col1:col3)``                    No one-line equivalent, but see [#select_range]_
+``select(df, col1:col3)``                    ``df.loc[:, 'col1':'col3']``
 ``select(df, -(col1:col3))``                 ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
 ``distinct(select(df, col1))``               ``df[['col1']].drop_duplicates()``
 ``distinct(select(df, col1, col2))``         ``df[['col1', 'col2']].drop_duplicates()``