Skip to content

DOC: expanding comparison with R section #12472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions doc/source/comparison_with_r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,79 @@ For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to
use HDF5 files, see :ref:`io.external_compatibility` for an
example.


Quick Reference
---------------

We'll start off with a quick reference guide pairing some common R
operations using `dplyr
<http://cran.r-project.org/web/packages/dplyr/index.html>`__ with
pandas equivalents.


Querying, Filtering, Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``dim(df)`` ``df.shape``
``head(df)`` ``df.head()``
``slice(df, 1:10)`` ``df.iloc[:9]``
``filter(df, col1 == 1, col2 == 1)`` ``df.query('col1 == 1 & col2 == 1')``
``df[df$col1 == 1 & df$col2 == 1,]`` ``df[(df.col1 == 1) & (df.col2 == 1)]``
``select(df, col1, col2)`` ``df[['col1', 'col2']]``
``select(df, col1:col3)`` No one-line equivalent, but see [#select_range]_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this equivalent to df.loc[:, 'col1':'col3'] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche based on my understanding of python, 'col1':'col3' would have to parse correctly as a range, and I don't think it does. But I'd be happy to be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work in this case, I've updated that notebook here. I can never remember the rules on slicing unsorted indexes, so I prefer to be explicit. For the comparison though I think it's fine to use 'col1':'col3'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the labels are actual column names, this works perfectly as expected (just from the one label to the other, regardless of the order). It's only when you use labels that are not included, that the index needs to be sorted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there you go

``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
``distinct(select(df, col1))`` ``df.col1.unique()``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because distinct(...) returns data.frame, I think pandas equivalent is df[['col1']].drop_duplicates().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()``
``sample_n(df, 10)`` ``df.sample(n=10)``
``sample_frac(df, 0.01)`` ``df.sample(frac=0.01)``
=========================================== ===========================================

.. [#select_range] R's shorthand for a subrange of columns
(``select(df, col1:col3)``) can be approached
cleanly in pandas, if you have the list of columns,
for example ``df[cols[1:3]]`` or
``df.drop(cols[1:3])``, but doing this by column
name is a bit messy.


Sorting
~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``arrange(df, col1, col2)`` ``df.sort(['col1', 'col2'])``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort is deprecated, pls change it to sort_values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

``arrange(df, desc(col1))`` ``df.sort('col1', ascending=False)``
=========================================== ===========================================

Transforming
~~~~~~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']``
``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})``
``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)``
=========================================== ===========================================


Grouping and Summarizing
~~~~~~~~~~~~~~~~~~~~~~~~

============================================== ===========================================
R pandas
============================================== ===========================================
``summary(df)`` ``df.describe()``
``gdf <- group_by(df, col1)`` ``gdf = df.groupby('col1')``
``summarise(gdf, avg=mean(col1, na.rm=TRUE))`` ``df.groupby('col1').agg({'col1': 'mean'})``
``summarise(gdf, total=sum(col1))`` ``df.groupby('col1').sum()``
============================================== ===========================================


Base R
------

Expand Down