Add SparseSeries.to_coo method, a single test and one example. #9076

cottrell · 2014-12-14T21:58:32Z

This passes locally run nosetests but is failing on Travis. I think the latest Travis CI changes might have caused failures. I was able to get master to pass on Travis last week but now it is failing.

shoyer · 2014-12-14T22:17:34Z

pandas/sparse/tests/test_sparse.py

+    df.iloc[3:-2,] = np.nan
+    df.iloc[:3,2:] = np.nan
+    df.iloc[-2:,:2] = np.nan
+    df.columns = MultiIndex.from_tuples([(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b'), (2, 2, 'c')]).T


.T on a MultiIndex doesn't do anything (it's only there for numpy compat)

cottrell · 2014-12-16T00:23:19Z

I simplified the test (and removed the .T). Simplifying the test made me realize the implementation was not correct so I have made some changes and added another test for sorted labels and symmetry.

jreback · 2014-12-16T11:28:36Z

examples/scipy_sparse/sparse_coo_demo.py

@@ -0,0 +1,24 @@
+from pandas import *
+from numpy import nan, array


something like this should just be in the sparse docs (with a release note section as well).

jreback · 2014-12-16T11:33:21Z

@cottrell need a more robust way to convert. I'll have a look and see if I can help you along here.

cottrell · 2014-12-17T22:11:07Z

I am pushing micro commits to this branch. Please let me know if this is not good github gitiquette and I can squash or save up for a bigger commit. I will hopefully get a few solid blocks of time after Friday, to make scipy_sparse.py less of a hack.

shoyer · 2014-12-17T22:14:03Z

@cottrell not a problem -- we'll ask you to squash at the end before merging anyways.

jreback · 2014-12-17T22:14:39Z

at the very end you can rebase/squash
until then you can do what u want

cottrell · 2014-12-28T22:06:59Z

I think it is probably ready for some more feedback. I've tried to do the following:

cleanup scipy_sparse.py.
Add SparseSeries.to_coo to api doc.
Add subsection about scipy.sparse in Sparse documentation.
Write vbench test.

I am not yet able to run vbench (maybe I should try on python 2.7?).

jreback · 2015-01-02T16:33:18Z

doc/source/sparse.rst

+                                            (2, 1, 'b', 0),
+                                            (2, 1, 'b', 1)])
+
+   ss = s.to_sparse() # SparseSeries


show s here (just put it on a line by itself)

cottrell · 2015-01-05T23:04:20Z

I have incorporated your comments and have reorganized the tests. I have also added a from_coo method. I can take the from_coo out if that is just complicating things. The vbench stuff is still untested unless it runs in Travis CI somehow.

jreback · 2015-01-06T01:02:16Z

doc/source/sparse.rst

+   il
+   jl
+
+``pandas.sparse.series.from_coo``


If you do this: :meth:~pandas.SparseSeries.from_coo`` then these will show up as links to the API docs

jreback · 2015-01-06T01:05:09Z

@cottrell looking pretty good. pls squash when you have a chance as well.

shoyer · 2015-01-30T07:15:43Z

pandas/sparse/scipy_sparse.py

+    # to keep things simple, only rely on integer indexing (not labels)
+    ilevels = [ss.index._get_level_number(x) for x in ilevels]
+    jlevels = [ss.index._get_level_number(x) for x in jlevels]
+    ss = ss.copy()


do you really want to copy the entire sparse series here? That could be expensive.

shoyer · 2015-01-30T08:05:00Z

I have a strong suspicion you can optimize many of these internal steps by using a MultiIndex. That would speed things up a lot (and simply much of the logic).

Otherwise, the API looks reasonable, though I would use variables with meaningful names like rows/columns instead of i/j and return MultiIndexes.

shoyer · 2015-01-30T08:10:30Z

pandas/sparse/scipy_sparse.py

+    blocs = ss._data.values.sp_index.blocs
+    blength = ss._data.values.sp_index.blengths
+    nonnull_labels = list(
+        itertools.chain(*[ss.index.values[i:(i + j)] for i, j in zip(blocs, blength)]))


If this is the slow step (which seems likely), you could write a little routine to do this in a loop in Cython.

shoyer · 2015-01-30T08:11:22Z

Obviously, performance is not strictly necessary here for this first draft but it's something to think about.

shoyer · 2015-02-03T20:34:52Z

Just to clarify, you don't need to vectorize everything here according to my suggestions. For a first draft, slow is better that not at all.

cottrell · 2015-02-04T23:50:51Z

I will hopefully soon vectorize as I think it will make it more readable. I sort of feel this is whole thing in its cleanest form should really be just a trivial application of a groupby on the levels but I am hitting a problem that I think I hit before which would push me down another path. Does anyone know if the following is a bug? I searched quickly but did not find anything. The problem seems to be only for groupby on index (with levels):

In [40]: i = pandas.MultiIndex.from_tuples([(1, 2, 'a', 0),
                                   (1, 2, 'a', 1),
                                   (1, 1, 'b', 0),
                                   (1, 1, 'b', 1),
                                   (2, 1, 'b', 0),
                                   (2, 1, 'b', 1)], names=['a', 'b', 'c', 'd'])

In [41]: a = pandas.Series([0, 1, 2, 3, 4, 5], index=i)

In [42]: a
Out[42]:
a  b  c  d
1  2  a  0    0
         1    1
   1  b  0    2
         1    3
2  1  b  0    4
         1    5
dtype: int64

In [43]: a.groupby(level=['a', 'b'], sort=False).first()
Out[43]:
a  b
1  1    2
   2    0
2  1    4
dtype: int64

In [44]: a.groupby(level=['a', 'b'], sort=True).first()
Out[44]:
a  b
1  1    2
   2    0
2  1    4
dtype: int64

In [45]: a.reset_index().groupby(['a', 'b'], sort=False)[0].first()
Out[45]:
a  b
1  2    0
   1    2
2  1    4
Name: 0, dtype: int64

shoyer · 2015-02-05T00:00:36Z

@cottrell can you should how you created a in this example? Also, a.index.levels and a.index.labels would likely demystify this .

cottrell · 2015-02-05T19:15:15Z

@shoyer sorry forgot to include the first bit. Have updated.

Actually, I've posted the sort problem and what seems to fix it here #9444

I think the problem might occur in a few other places that were not hitting me.

cottrell · 2015-02-09T08:52:09Z

I have incorporated some of the comments from @shoyer which has shortened code quite a bit. This included the patch at #9444. There are some TODO to review ... worried a bit about relying on the behaviour of some pandas operations for certain steps now (for example dropna on sparse series).

Also, have added examples to the to/from_coo docstrings.

On the downside, the performance is now much, much worse. It looks to me like this is due to the slowness of groupby on MultiIndex levels:

(new version) http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/pandas%20scipy.sparse%20%28with%20groupby%29%20timiing.ipynb
(old version)
http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/Performance%20scipy.sparse%20examples.ipynb

I reduced the size of the test matrices in the new version as it was running so slowly.

cottrell · 2015-02-21T18:31:45Z

Performance was too bad using the groupby method so I have reverted to the older python method and left the groupby method commented out (lines 56-57 of scipy_sparse.py).

Some timing results here suggest it is some overhead in the groupby methods that is the difference.
http://nbviewer.ipython.org/github/cottrell/notebooks/blob/master/pandas%20scipy.sparse%20performance.ipynb

Please let me know if you have any ideas about MultiIndex groupby performance. I have not tried yet, but a diff of pstats output might be useful.

shoyer · 2015-02-21T19:14:35Z

pandas/sparse/scipy_sparse.py

+    # from the SparseSeries: get the labels and data for non-null entries
+    values = ss._data.values._valid_sp_values
+
+    # TODO: is replacing this code with dropna below safe?


Yes, I think this is safe.

Thanks, I've dropped that chunk of code and the TODO.

…py.sparse.

jreback · 2015-03-03T00:59:07Z

merged via 9ac01a7

@cottrell thank you very much for this!

pls review built docs (give about an hour for travis to build). Pls submit a follow up pr if needed (or for any changes that you think are necessary; or open an issue).

jorisvandenbossche · 2015-03-04T16:02:03Z

@cottrell There is an error in the docs:

>>>-------------------------------------------------------------------------

Exception in /tmp/doc/source/sparse.rst at block ending on line 195

Specify :okexcept: as an option in the ipython:: block to suppress this message

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-36-a2b3f3f12d26> in <module>()

1 A, rows, columns = ss.to_coo(row_levels=['A', 'B', 'C'],

2 column_levels=['D'],

----> 3 sort_labels=False)

/home/travis/build/pydata/pandas/pandas/sparse/series.pyc in to_coo(self, row_levels, column_levels, sort_labels)

709 """

710 A, rows, columns = _sparse_series_to_coo(

--> 711 self, row_levels, column_levels, sort_labels=sort_labels)

712 return A, rows, columns

713

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _sparse_series_to_coo(ss, row_levels, column_levels, sort_labels)

113

114 v, i, j, rows, columns = _to_ijv(

--> 115 ss, row_levels=row_levels, column_levels=column_levels, sort_labels=sort_labels)

116 sparse_matrix = scipy.sparse.coo_matrix(

117 (v, (i, j)), shape=(len(rows), len(columns)))

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _to_ijv(ss, row_levels, column_levels, sort_labels)

90

91 i_coord, i_labels = get_indexers(row_levels)

---> 92 j_coord, j_labels = get_indexers(column_levels)

93

94 return values, i_coord, j_coord, i_labels, j_labels

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in get_indexers(levels)

80

81 labels_to_i = _get_index_subset_to_coord_dict(

---> 82 ss.index, levels, sort_labels=sort_labels)

83 #######################################################################

84 #######################################################################

/home/travis/build/pydata/pandas/pandas/sparse/scipy_sparse.pyc in _get_index_subset_to_coord_dict(index, subset, sort_labels)

74 ilabels, sort_labels=sort_labels)

75 labels_to_i = Series(labels_to_i)

---> 76 labels_to_i.index = MultiIndex.from_tuples(labels_to_i.index)

77 labels_to_i.index.names = [index.names[i] for i in subset]

78 labels_to_i.name = 'value'

/home/travis/build/pydata/pandas/pandas/core/index.pyc in from_tuples(cls, tuples, sortorder, names)

3604 tuples = tuples.values

3605

-> 3606 arrays = list(lib.tuples_to_object_array(tuples).T)

3607 elif isinstance(tuples, list):

3608 arrays = list(lib.to_object_array_tuples(tuples).T)

/home/travis/build/pydata/pandas/pandas/lib.so in pandas.lib.tuples_to_object_array (pandas/lib.c:54298)()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

<<<-------------------------------------------------------------------------

cottrell · 2015-03-04T21:13:25Z

@jorisvandenbossche

Yes, this is hopefully addressed here:

#9583

I was building docs only with python 3 so didn't catch this on my setup.

cottrell mentioned this pull request Dec 14, 2014

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

Closed

shoyer added Sparse Sparse Data Type API Design labels Dec 14, 2014

shoyer reviewed Dec 14, 2014
View reviewed changes

jreback reviewed Dec 16, 2014
View reviewed changes

jreback added this to the 0.16.0 milestone Jan 2, 2015

jreback reviewed Jan 2, 2015
View reviewed changes

jreback reviewed Jan 6, 2015
View reviewed changes

shoyer reviewed Jan 30, 2015
View reviewed changes

cottrell mentioned this pull request Feb 8, 2015

sort=False ignored in Series groupby on MultiIndex levels #9444

Closed

cottrell force-pushed the to_coo branch from db43e4a to f31aa7b Compare February 8, 2015 22:31

cottrell force-pushed the to_coo branch 2 times, most recently from 21451f0 to 608f48a Compare February 10, 2015 23:22

cottrell force-pushed the to_coo branch from 608f48a to f3aa8a0 Compare February 21, 2015 12:55

shoyer reviewed Feb 21, 2015
View reviewed changes

Add SparseSeries.to_coo and from_coo methods for interaction with sci…

56f06f5

…py.sparse.

cottrell force-pushed the to_coo branch from f3aa8a0 to 56f06f5 Compare February 22, 2015 01:12

jreback closed this Mar 3, 2015

kernc mentioned this pull request Apr 24, 2017

API/DEPR: deprecate SparseSeries.from_coo and accept in constructor #15634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SparseSeries.to_coo method, a single test and one example. #9076

Add SparseSeries.to_coo method, a single test and one example. #9076

cottrell commented Dec 14, 2014

shoyer Dec 14, 2014

cottrell commented Dec 16, 2014

jreback Dec 16, 2014

jreback commented Dec 16, 2014

cottrell commented Dec 17, 2014

shoyer commented Dec 17, 2014

jreback commented Dec 17, 2014

cottrell commented Dec 28, 2014

jreback Jan 2, 2015

cottrell commented Jan 5, 2015

jreback Jan 6, 2015

jreback commented Jan 6, 2015

shoyer Jan 30, 2015

shoyer commented Jan 30, 2015

shoyer Jan 30, 2015

shoyer commented Jan 30, 2015

shoyer commented Feb 3, 2015

cottrell commented Feb 4, 2015

shoyer commented Feb 5, 2015

cottrell commented Feb 5, 2015

cottrell commented Feb 9, 2015

cottrell commented Feb 21, 2015

shoyer Feb 21, 2015

cottrell Feb 22, 2015

jreback commented Mar 3, 2015

jorisvandenbossche commented Mar 4, 2015

cottrell commented Mar 4, 2015

		@@ -0,0 +1,24 @@
		from pandas import *
		from numpy import nan, array

Add SparseSeries.to_coo method, a single test and one example. #9076

Add SparseSeries.to_coo method, a single test and one example. #9076

Conversation

cottrell commented Dec 14, 2014

Choose a reason for hiding this comment

cottrell commented Dec 16, 2014

Choose a reason for hiding this comment

jreback commented Dec 16, 2014

cottrell commented Dec 17, 2014

shoyer commented Dec 17, 2014

jreback commented Dec 17, 2014

cottrell commented Dec 28, 2014

Choose a reason for hiding this comment

cottrell commented Jan 5, 2015

Choose a reason for hiding this comment

jreback commented Jan 6, 2015

Choose a reason for hiding this comment

shoyer commented Jan 30, 2015

Choose a reason for hiding this comment

shoyer commented Jan 30, 2015

shoyer commented Feb 3, 2015

cottrell commented Feb 4, 2015

shoyer commented Feb 5, 2015

cottrell commented Feb 5, 2015

cottrell commented Feb 9, 2015

cottrell commented Feb 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 3, 2015

jorisvandenbossche commented Mar 4, 2015

cottrell commented Mar 4, 2015