-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add SparseSeries.to_coo method, a single test and one example. #9076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
df.iloc[3:-2,] = np.nan | ||
df.iloc[:3,2:] = np.nan | ||
df.iloc[-2:,:2] = np.nan | ||
df.columns = MultiIndex.from_tuples([(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b'), (2, 2, 'c')]).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.T
on a MultiIndex doesn't do anything (it's only there for numpy compat)
I simplified the test (and removed the .T). Simplifying the test made me realize the implementation was not correct so I have made some changes and added another test for sorted labels and symmetry. |
@@ -0,0 +1,24 @@ | |||
from pandas import * | |||
from numpy import nan, array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this should just be in the sparse docs (with a release note section as well).
@cottrell need a more robust way to convert. I'll have a look and see if I can help you along here. |
I am pushing micro commits to this branch. Please let me know if this is not good github gitiquette and I can squash or save up for a bigger commit. I will hopefully get a few solid blocks of time after Friday, to make scipy_sparse.py less of a hack. |
@cottrell not a problem -- we'll ask you to squash at the end before merging anyways. |
at the very end you can rebase/squash |
I think it is probably ready for some more feedback. I've tried to do the following:
I am not yet able to run vbench (maybe I should try on python 2.7?). |
(2, 1, 'b', 0), | ||
(2, 1, 'b', 1)]) | ||
|
||
ss = s.to_sparse() # SparseSeries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show s
here (just put it on a line by itself)
I have incorporated your comments and have reorganized the tests. I have also added a from_coo method. I can take the from_coo out if that is just complicating things. The vbench stuff is still untested unless it runs in Travis CI somehow. |
il | ||
jl | ||
|
||
``pandas.sparse.series.from_coo`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this: :meth:
~pandas.SparseSeries.from_coo`` then these will show up as links to the API docs
@cottrell looking pretty good. pls squash when you have a chance as well. |
# to keep things simple, only rely on integer indexing (not labels) | ||
ilevels = [ss.index._get_level_number(x) for x in ilevels] | ||
jlevels = [ss.index._get_level_number(x) for x in jlevels] | ||
ss = ss.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really want to copy the entire sparse series here? That could be expensive.
I have a strong suspicion you can optimize many of these internal steps by using a MultiIndex. That would speed things up a lot (and simply much of the logic). Otherwise, the API looks reasonable, though I would use variables with meaningful names like |
blocs = ss._data.values.sp_index.blocs | ||
blength = ss._data.values.sp_index.blengths | ||
nonnull_labels = list( | ||
itertools.chain(*[ss.index.values[i:(i + j)] for i, j in zip(blocs, blength)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the slow step (which seems likely), you could write a little routine to do this in a loop in Cython.
Obviously, performance is not strictly necessary here for this first draft but it's something to think about. |
Just to clarify, you don't need to vectorize everything here according to my suggestions. For a first draft, slow is better that not at all. |
I will hopefully soon vectorize as I think it will make it more readable. I sort of feel this is whole thing in its cleanest form should really be just a trivial application of a groupby on the levels but I am hitting a problem that I think I hit before which would push me down another path. Does anyone know if the following is a bug? I searched quickly but did not find anything. The problem seems to be only for groupby on index (with levels):
|
@cottrell can you should how you created |
I have incorporated some of the comments from @shoyer which has shortened code quite a bit. This included the patch at #9444. There are some TODO to review ... worried a bit about relying on the behaviour of some pandas operations for certain steps now (for example dropna on sparse series). Also, have added examples to the to/from_coo docstrings. On the downside, the performance is now much, much worse. It looks to me like this is due to the slowness of groupby on MultiIndex levels: (new version) http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/pandas%20scipy.sparse%20%28with%20groupby%29%20timiing.ipynb I reduced the size of the test matrices in the new version as it was running so slowly. |
21451f0
to
608f48a
Compare
Performance was too bad using the groupby method so I have reverted to the older python method and left the groupby method commented out (lines 56-57 of scipy_sparse.py). Some timing results here suggest it is some overhead in the groupby methods that is the difference. Please let me know if you have any ideas about MultiIndex groupby performance. I have not tried yet, but a diff of pstats output might be useful. |
# from the SparseSeries: get the labels and data for non-null entries | ||
values = ss._data.values._valid_sp_values | ||
|
||
# TODO: is replacing this code with dropna below safe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've dropped that chunk of code and the TODO.
@cottrell There is an error in the docs:
|
Yes, this is hopefully addressed here: I was building docs only with python 3 so didn't catch this on my setup. |
xref #4343
closes #8048
This passes locally run nosetests but is failing on Travis. I think the latest Travis CI changes might have caused failures. I was able to get master to pass on Travis last week but now it is failing.