Skip to content

Add SparseSeries.to_coo method, a single test and one example. #9076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cottrell
Copy link
Contributor

xref #4343
closes #8048

This passes locally run nosetests but is failing on Travis. I think the latest Travis CI changes might have caused failures. I was able to get master to pass on Travis last week but now it is failing.

df.iloc[3:-2,] = np.nan
df.iloc[:3,2:] = np.nan
df.iloc[-2:,:2] = np.nan
df.columns = MultiIndex.from_tuples([(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b'), (2, 2, 'c')]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.T on a MultiIndex doesn't do anything (it's only there for numpy compat)

@cottrell
Copy link
Contributor Author

I simplified the test (and removed the .T). Simplifying the test made me realize the implementation was not correct so I have made some changes and added another test for sorted labels and symmetry.

@@ -0,0 +1,24 @@
from pandas import *
from numpy import nan, array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this should just be in the sparse docs (with a release note section as well).

@jreback
Copy link
Contributor

jreback commented Dec 16, 2014

@cottrell need a more robust way to convert. I'll have a look and see if I can help you along here.

@cottrell
Copy link
Contributor Author

I am pushing micro commits to this branch. Please let me know if this is not good github gitiquette and I can squash or save up for a bigger commit. I will hopefully get a few solid blocks of time after Friday, to make scipy_sparse.py less of a hack.

@shoyer
Copy link
Member

shoyer commented Dec 17, 2014

@cottrell not a problem -- we'll ask you to squash at the end before merging anyways.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2014

at the very end you can rebase/squash
until then you can do what u want

@cottrell
Copy link
Contributor Author

I think it is probably ready for some more feedback. I've tried to do the following:

  1. cleanup scipy_sparse.py.
  2. Add SparseSeries.to_coo to api doc.
  3. Add subsection about scipy.sparse in Sparse documentation.
  4. Write vbench test.

I am not yet able to run vbench (maybe I should try on python 2.7?).

@jreback jreback added this to the 0.16.0 milestone Jan 2, 2015
(2, 1, 'b', 0),
(2, 1, 'b', 1)])

ss = s.to_sparse() # SparseSeries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show s here (just put it on a line by itself)

@cottrell
Copy link
Contributor Author

cottrell commented Jan 5, 2015

I have incorporated your comments and have reorganized the tests. I have also added a from_coo method. I can take the from_coo out if that is just complicating things. The vbench stuff is still untested unless it runs in Travis CI somehow.

il
jl

``pandas.sparse.series.from_coo``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this: :meth:~pandas.SparseSeries.from_coo`` then these will show up as links to the API docs

@jreback
Copy link
Contributor

jreback commented Jan 6, 2015

@cottrell looking pretty good. pls squash when you have a chance as well.

# to keep things simple, only rely on integer indexing (not labels)
ilevels = [ss.index._get_level_number(x) for x in ilevels]
jlevels = [ss.index._get_level_number(x) for x in jlevels]
ss = ss.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really want to copy the entire sparse series here? That could be expensive.

@shoyer
Copy link
Member

shoyer commented Jan 30, 2015

I have a strong suspicion you can optimize many of these internal steps by using a MultiIndex. That would speed things up a lot (and simply much of the logic).

Otherwise, the API looks reasonable, though I would use variables with meaningful names like rows/columns instead of i/j and return MultiIndexes.

blocs = ss._data.values.sp_index.blocs
blength = ss._data.values.sp_index.blengths
nonnull_labels = list(
itertools.chain(*[ss.index.values[i:(i + j)] for i, j in zip(blocs, blength)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the slow step (which seems likely), you could write a little routine to do this in a loop in Cython.

@shoyer
Copy link
Member

shoyer commented Jan 30, 2015

Obviously, performance is not strictly necessary here for this first draft but it's something to think about.

@shoyer
Copy link
Member

shoyer commented Feb 3, 2015

Just to clarify, you don't need to vectorize everything here according to my suggestions. For a first draft, slow is better that not at all.

@cottrell
Copy link
Contributor Author

cottrell commented Feb 4, 2015

I will hopefully soon vectorize as I think it will make it more readable. I sort of feel this is whole thing in its cleanest form should really be just a trivial application of a groupby on the levels but I am hitting a problem that I think I hit before which would push me down another path. Does anyone know if the following is a bug? I searched quickly but did not find anything. The problem seems to be only for groupby on index (with levels):

In [40]: i = pandas.MultiIndex.from_tuples([(1, 2, 'a', 0),
                                   (1, 2, 'a', 1),
                                   (1, 1, 'b', 0),
                                   (1, 1, 'b', 1),
                                   (2, 1, 'b', 0),
                                   (2, 1, 'b', 1)], names=['a', 'b', 'c', 'd'])

In [41]: a = pandas.Series([0, 1, 2, 3, 4, 5], index=i)

In [42]: a
Out[42]:
a  b  c  d
1  2  a  0    0
         1    1
   1  b  0    2
         1    3
2  1  b  0    4
         1    5
dtype: int64

In [43]: a.groupby(level=['a', 'b'], sort=False).first()
Out[43]:
a  b
1  1    2
   2    0
2  1    4
dtype: int64

In [44]: a.groupby(level=['a', 'b'], sort=True).first()
Out[44]:
a  b
1  1    2
   2    0
2  1    4
dtype: int64

In [45]: a.reset_index().groupby(['a', 'b'], sort=False)[0].first()
Out[45]:
a  b
1  2    0
   1    2
2  1    4
Name: 0, dtype: int64

@shoyer
Copy link
Member

shoyer commented Feb 5, 2015

@cottrell can you should how you created a in this example? Also, a.index.levels and a.index.labels would likely demystify this .

@cottrell
Copy link
Contributor Author

cottrell commented Feb 5, 2015

@shoyer sorry forgot to include the first bit. Have updated.

Actually, I've posted the sort problem and what seems to fix it here #9444

I think the problem might occur in a few other places that were not hitting me.

@cottrell
Copy link
Contributor Author

cottrell commented Feb 9, 2015

I have incorporated some of the comments from @shoyer which has shortened code quite a bit. This included the patch at #9444. There are some TODO to review ... worried a bit about relying on the behaviour of some pandas operations for certain steps now (for example dropna on sparse series).

Also, have added examples to the to/from_coo docstrings.

On the downside, the performance is now much, much worse. It looks to me like this is due to the slowness of groupby on MultiIndex levels:

(new version) http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/pandas%20scipy.sparse%20%28with%20groupby%29%20timiing.ipynb
(old version)
http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/Performance%20scipy.sparse%20examples.ipynb

I reduced the size of the test matrices in the new version as it was running so slowly.

@cottrell
Copy link
Contributor Author

Performance was too bad using the groupby method so I have reverted to the older python method and left the groupby method commented out (lines 56-57 of scipy_sparse.py).

Some timing results here suggest it is some overhead in the groupby methods that is the difference.
http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/pandas%20scipy.sparse%20performance.ipynb

Please let me know if you have any ideas about MultiIndex groupby performance. I have not tried yet, but a diff of pstats output might be useful.

# from the SparseSeries: get the labels and data for non-null entries
values = ss._data.values._valid_sp_values

# TODO: is replacing this code with dropna below safe?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've dropped that chunk of code and the TODO.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

merged via 9ac01a7

@cottrell thank you very much for this!

pls review built docs (give about an hour for travis to build). Pls submit a follow up pr if needed (or for any changes that you think are necessary; or open an issue).

@jreback jreback closed this Mar 3, 2015
@jorisvandenbossche
Copy link
Member

@cottrell There is an error in the docs:

>>>-------------------------------------------------------------------------

Exception in /tmp/doc/source/sparse.rst at block ending on line 195

Specify :okexcept: as an option in the ipython:: block to suppress this message

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-36-a2b3f3f12d26> in <module>()

1 A, rows, columns = ss.to_coo(row_levels=['A', 'B', 'C'],

2 column_levels=['D'],

----> 3 sort_labels=False)

/home/travis/build/pydata/pandas/pandas/sparse/series.pyc in to_coo(self, row_levels, column_levels, sort_labels)

709 """

710 A, rows, columns = _sparse_series_to_coo(

--> 711 self, row_levels, column_levels, sort_labels=sort_labels)

712 return A, rows, columns

713

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _sparse_series_to_coo(ss, row_levels, column_levels, sort_labels)

113

114 v, i, j, rows, columns = _to_ijv(

--> 115 ss, row_levels=row_levels, column_levels=column_levels, sort_labels=sort_labels)

116 sparse_matrix = scipy.sparse.coo_matrix(

117 (v, (i, j)), shape=(len(rows), len(columns)))

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _to_ijv(ss, row_levels, column_levels, sort_labels)

90

91 i_coord, i_labels = get_indexers(row_levels)

---> 92 j_coord, j_labels = get_indexers(column_levels)

93

94 return values, i_coord, j_coord, i_labels, j_labels

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in get_indexers(levels)

80

81 labels_to_i = _get_index_subset_to_coord_dict(

---> 82 ss.index, levels, sort_labels=sort_labels)

83 #######################################################################

84 #######################################################################

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _get_index_subset_to_coord_dict(index, subset, sort_labels)

74 ilabels, sort_labels=sort_labels)

75 labels_to_i = Series(labels_to_i)

---> 76 labels_to_i.index = MultiIndex.from_tuples(labels_to_i.index)

77 labels_to_i.index.names = [index.names[i] for i in subset]

78 labels_to_i.name = 'value'

/home/travis/build/pydata/pandas/pandas/core/index.pyc in from_tuples(cls, tuples, sortorder, names)

3604 tuples = tuples.values

3605

-> 3606 arrays = list(lib.tuples_to_object_array(tuples).T)

3607 elif isinstance(tuples, list):

3608 arrays = list(lib.to_object_array_tuples(tuples).T)

/home/travis/build/pydata/pandas/pandas/lib.so in pandas.lib.tuples_to_object_array (pandas/lib.c:54298)()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

<<<-------------------------------------------------------------------------

@cottrell
Copy link
Contributor Author

cottrell commented Mar 4, 2015

@jorisvandenbossche

Yes, this is hopefully addressed here:

#9583

I was building docs only with python 3 so didn't catch this on my setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas MultiIndex series "unstack" to scipy sparse array functionality
4 participants