ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497

kernc · 2017-02-24T10:28:38Z

closes ENH: native sparse conversion from csc/csr scipy sparse matrix to DataFrame #4343
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

kernc · 2017-02-24T10:29:46Z

The tests require scipy, however scipy is not a pandas dependency. What do?

jreback · 2017-02-24T10:50:53Z

scipy is already included in many of the test runs e.g. https://github.com/pandas-dev/pandas/blob/master/ci/requirements-2.7.run

you need to use tm._skip_if_no_scipy in your tests (assume u don't need a specific version required)

in code it's a bit more tricky - l will comment on review in a bit

jreback · 2017-02-24T10:51:46Z

see for example .interpolate where we dispatch to scipy (or raise an error if it's not installed) for some methods

jreback · 2017-02-24T12:20:47Z

so this API is a bit at odds with what was done in SparseSeries.from_coo, meaning that we have a specific 'type' factory function. We don't actually follow this pattern anywhere else (e.g. even DataFrame.from_dict is just a small wrapper on the DataFrame constructor).

So for API consistency, I think it makes sense to change SparseSeries.from_coo in a similar way.

@wesm IIRC you had some thoughts w.r.t. to this.

jreback

looks generally good, but some pandas style requirements (on the code) and methodologies

jreback · 2017-02-24T12:21:20Z

doc/source/whatsnew/v0.20.0.txt

+
+SciPy sparse matrix from/to SparseDataFrame
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Pandas now supports creating SparseDataFrames directly from ``scipy.sparse.spmatrix`` instances. E.g.


add the issue number after instanes. Remove the 'E.g'.

jreback · 2017-02-24T12:21:50Z

doc/source/whatsnew/v0.20.0.txt

+   sdf = pd.DataFrame(sp_arr)
+   sdf
+
+To convert a SparseDataFrame back to scipy sparse matrix in COO format, you can use:


add this example to the sparse docs as well.

jreback · 2017-02-24T12:25:13Z

pandas/sparse/frame.py

 from pandas.util.decorators import Appender
 import pandas.core.ops as ops

+try:


so we don't want to do an import here, which will affect all of pandas.

create a function pandas.types.common.is_scipy_sparse

then can either:

do the import (and thus check isinstance), you may need to jump some hoops to avoid repeatedly trying this import (you make a _try_import_scipy_sparse function), then assign the results to a module level variable on first use. see in pandas.io.pytables how _tables() works.

OR, and maybe better, if scipy sparse matrixes are duck-type inferable, IOW, if its possible to check if they have a couple of key attributes then I would just do this.

jreback · 2017-02-24T12:25:51Z

pandas/sparse/frame.py

        elif isinstance(data, BlockManager):
            mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
                                 dtype=dtype, copy=copy)
+        elif isinstance(data, spmatrix):


so this will change to is_scipy_sparse_matrix

jreback · 2017-02-24T12:26:36Z

pandas/sparse/frame.py

@@ -175,6 +176,33 @@ def _init_dict(self, data, index, columns, dtype=None):

    def _init_matrix(self, data, index, columns, dtype=None):
        data = _prep_ndarray(data, copy=False)


can you add a 1-liner doc (what this function is doing)

jreback · 2017-02-24T12:29:49Z

pandas/sparse/frame.py


-        data = dict([(idx, data[:, i]) for i, idx in enumerate(columns)])
-        return self._init_dict(data, index, columns, dtype)
+    def as_matrix(self, columns=None, sparse=False):


I understand why you are changing / adding this. BUT this is actually a major API change. This is what .values calls and as a result the return types are: dense numpy arrays / pandas extension types (categorical).

Not against this (in fact I think its a good idea), but split this out into a separate issue. Then the impl would actually be done in the block manager.

jreback · 2017-02-24T12:30:15Z

pandas/sparse/frame.py

+    def to_coo(self):
+        """
+        Convert the frame to its SciPy sparse COO matrix representation.
+


add a versionadded

jreback · 2017-02-24T12:31:11Z

pandas/tests/sparse/test_frame.py

+    def test_from_to_scipy(self):
+        # GH 4343
+        try:
+            from scipy.sparse import csr_matrix


use tm._skip_if_no_scipy

jreback · 2017-02-24T12:32:49Z

pandas/tests/sparse/test_frame.py

+        for index, columns in ((list('abc'), list('def')),
+                               (None, None)):
+            sdf = pd.SparseDataFrame(spm, index=index, columns=columns)
+


I would rather you actually construct the SDF directly and simply use tm.assert_sp_frame_equal which already does all of this.

Further check for various fill types as well.

I would write this using pytest.parametrize (as our new way of doing things)

jreback · 2017-02-24T12:35:30Z

pandas/sparse/frame.py

+        return self._init_dict(data, index, columns, dtype)
+
+    def _init_spmatrix(self, data, index, columns, dtype=None):
+        index, columns = self._prep_index(data, index, columns)


1-liner here as well

wesm · 2017-02-27T02:34:15Z

I don't have too many opinions on this feature

jorisvandenbossche · 2017-02-27T09:00:05Z

doc/source/whatsnew/v0.20.0.txt

+   arr[arr < .9] = 0
+   sp_arr = csr_matrix(arr)
+   sp_arr
+   sdf = pd.DataFrame(sp_arr)


Should this be SparseDataFrame(sp_arr) instead of DataFrame?

jorisvandenbossche · 2017-02-27T09:01:58Z

Never use sparse so also not much opinion, but a question: there already exists a SparseSeries.from_coo method. Would it therefore make sense to have a SparseDataFrame.from_csr method instead of overloading the constructor?

jreback · 2017-02-27T14:35:21Z

well @jorisvandenbossche that's exactly the question. just because we have .from_coo doesn't mean which should have specific constructors. In fact we don't for anything else. So IMHO we should just fold these into the constructor (IOW, leave this PR alone and deprecate and change .from_coo

codecov-io · 2017-03-01T17:23:52Z

Codecov Report

Merging #15497 into master will decrease coverage by 0.02%.
The diff coverage is 88.33%.

@@            Coverage Diff             @@
##           master   #15497      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49324    49369      +45     
==========================================
+ Hits        44897    44925      +28     
- Misses       4427     4444      +17

Impacted Files	Coverage Δ
pandas/sparse/array.py	`91.46% <100%> (+0.04%)`	✅
pandas/util/testing.py	`80.92% <40%> (-0.19%)`	❌
pandas/types/common.py	`93.93% <75%> (-0.53%)`	❌
pandas/sparse/frame.py	`96.68% <95.45%> (-0.01%)`	❌
pandas/io/gbq.py	`25% <0%> (-58.34%)`	❌
pandas/tools/merge.py	`91.76% <0%> (-0.35%)`	❌
pandas/types/cast.py	`85.45% <0%> (-0.21%)`	❌
pandas/core/frame.py	`97.87% <0%> (-0.1%)`	❌
pandas/core/common.py	`91.3% <0%> (+0.33%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5667a3a...a0f2208. Read the comment docs.

kernc · 2017-03-01T17:30:44Z

pandas/types/common.py

+def is_scipy_sparse(array):
+    """ return if we are a scipy.sparse.spmatrix """
+    global _spmatrix
+    if _spmatrix is None:


Avoided importing in a separate function as this seemed concise enough.

Am not aware of any features that would unambiguously discern scipy sparse matrices. We'd have to check for what we use later on, namely .shape and .tocoo(copy: bool) -> scipy.sparse.coo_matrix. Checking the type directly looks cleaner.

yeah ok that's fine

kernc · 2017-03-01T17:30:50Z

pandas/tests/sparse/test_frame.py

@@ -1118,6 +1120,28 @@ def test_isnotnull(self):
                            'B': [True, False, True, True, False]})
        tm.assert_frame_equal(res.to_dense(), exp)

+    @pytest.mark.importorskip('scipy.sparse')


Used tm._skip_if_no_scipy() first, but then opted for this. Is that ok?

yes that's fine if on that test

actually on 2nd thought, use the tm._skip.... as we are consistently using that. (does the same thing)

kernc · 2017-03-01T17:33:07Z

doc/source/sparse.rst


-A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
+Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices.


Normally, versionadded parts are appended at the end, but I thought this prominent, hopefully non-experimental feature should be exposed at the top ...

no that's fine, just mention, starting in 0.20.0 (or can use a versionadded tag)

kernc · 2017-03-01T17:47:27Z

So for API consistency, I think it makes sense to change SparseSeries.from_coo in a similar way.

What way is that? SparseSeries accepting 1×N sparse matrices in the constructor?

Would it make sense to have a SparseDataFrame.from_csr method instead of overloading the constructor?

There are other types of sparse matrices and while the user could be asked to convert to a specific format, why should they be forced to. Conversion from one sparse format to another is cheap (relative to any operation that makes anything dense). Several spmatrix methods do the conversion to the optimal type for the operation internally as needed. And spmatrix objects are still de facto sparse structures in this stack, so accepting them in the constructor would not be an unexpected functionality.

jreback · 2017-03-01T17:53:28Z

There are other types of sparse matrices and while the user could be asked to convert to a specific format, why should they be forced to. Conversion from one sparse format to another is cheap (relative to any operation that makes anything dense). Several spmatrix methods do the conversion to the optimal type for the operation internally as needed. And spmatrix objects are still de facto sparse structures in this stack, so accepting them in the constructor would not be an unexpected functionality.

so the question is to we have several .from_type_of_sparse_conversion routines, OR attempt to figure this out in the constructor?

I think it should be consistent 1 way or the other. (e.g. NO implicit construction in the constructor / maybe raise an error actually, with .from_* methods), or ALL constructions in the constructor.

kernc · 2017-03-01T18:03:27Z

There are no relevant from_* methods on SparseDataFrame currently, only one on SparseSeries, which I don't yet have an opinion on, except that it seems a bit cumbersome and it's not a frame (ndim != 2) whereas a spmatrix is (ndim==2).

As is, we already support all scipy.sparse.spmatrix subclasses through the constructor, so individual per-type from_* methods are unnecessary. But I'm happy to add a thin factory method SparseDataFrame.from_scipy (or equivalent) that calls the constructor if the consensus is this importantly aids readability. (I'm actually 👎. SciPy sparse matrices are as ubiquitous in this stack that accepting them in the SparseDataFrame constructor is as natural as accepting ndarrays in DataFrame's.)

jreback · 2017-03-01T18:14:22Z

As is, we already support all scipy.sparse.spmatrix subclasses through the constructor, so individual per-type from_* methods are unnecessary. But I'm happy to add a thin factory method SparseDataFrame.from_scipy (or equivalent) that calls the constructor if the consensus is this importantly aids readability. (I'm actually 👎. SciPy sparse matrices are as ubiquitous in this stack that accepting them in the SparseDataFrame constructor is as natural as accepting ndarrays in DataFrame's.)

@kernc so I think you answered my question. If we can unambiguously accept scipy sparse matricies w/o having to have a .from_* then I am all for it. So would then deprecateSparseSeries.from_coo, and move inside the constructor instead.

kernc · 2017-03-01T18:24:05Z

But deprecate SparseSeries.from_coo in favor of what? Accepting a 1×N scipy matrix in the constructor and making it a Series, and accepting a N×M scipy matrix and making it a Series indexed by a MultiIndex?

Basically, whatever but just accepting stuff in the constructor?

jreback · 2017-03-01T18:28:10Z

But deprecate SparseSeries.from_coo in favor of what? Accepting a 1×N scipy matrix in the constructor and making it a Series, and accepting a N×M scipy matrix and making it a Series indexed by a MultiIndex?

Simply moving the functionailty into the constuctor itself (almost exactly as it is). This is for SparseSeries (and NOT SparseDataFrame)

IOW, something like

class SparseSeries(Series):

     ........

      if is_scipy_matrix(..):
           if is_coo_matrix(....):
                 # do what .from_coo does now
           else:
                raise NotImplentedError('on the todo list')

jreback · 2017-03-01T18:31:10Z

@kernc of course you may still want these routines if you need to control the conversion process in some way; though we maybe could add those as optional params to the constructor themselves.

jreback · 2017-03-01T18:37:53Z

doc/source/sparse.rst

+
+.. versionadded:: 0.16.0
+
+Additionally, an experimental :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.


you can drop experimental

jreback

looks pretty good. few comments.

jreback · 2017-03-01T18:38:11Z

doc/source/whatsnew/v0.20.0.txt

+.. _whatsnew_0200.enhancements.scipy_sparse:
+
+SciPy sparse matrix from/to SparseDataFrame
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


add a ref to the new docs you created

jreback · 2017-03-01T18:38:34Z

doc/source/whatsnew/v0.20.0.txt

+SciPy sparse matrix from/to SparseDataFrame
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`).
+


mention if this doesn't work with particular scipy types (and in docs too)

jreback · 2017-03-01T18:39:40Z

pandas/sparse/frame.py

+    def _init_spmatrix(self, data, index, columns, dtype=None):
+        """ Init self from scipy.sparse matrix """
+        index, columns = self._prep_index(data, index, columns)
+        data = data.tocoo(copy=False)


ahh I see, tricky! does this have untoward side effects? (e.g. for certain matrices it is bad, meaning it doesn't work or forces a copy?)

Works for any scipy.sparse.spmatrix. As per docs, it makes a copy if it needs to, but this is a copy of the sparse data, which is relatively cheap. Sparse matrix methods in SciPy do the conversion to the optimal type for the operation internally as needed, e.g. see coo_matrix.maximum, lil_matrix.getcol, dok_matrix.__mul__ via spmatrix._mul_sparse_matrix ...

I am noting the transparent conversion (and potential copy) in the docs.

ok that is fine

jreback · 2017-03-01T18:40:10Z

pandas/sparse/frame.py

+        data = data.tocoo(copy=False)
+        N = len(index)
+        bindex = np.arange(N, dtype=np.int32)
+


add a 1-liner on what you are doing

jreback · 2017-03-01T18:41:02Z

pandas/sparse/frame.py

-        return self._init_dict(data, index, columns, dtype)
+    def to_coo(self):
+        """
+        Convert the frame to its SciPy sparse COO matrix representation.


does this require any special index? mention either way (and what happens to the index anyhow)? what about column names?

There is no index (either top or side). A simple 2-D spmatrix (ndim==2) matrix is returned, laid out as the frame was. It's like .values, but for retaining sparsity.

What do I write?

no, my point is that the index (of the DF) is discarded (well maybe the column names are kept?) Its find, just need to point that out.

SciPy matrices are indexed (rows and columns) sequentially 0..(N-1), 0..(M-1), like ndarrays. Obviously, none of the DF indices are retained on the coo_matrix, only the raw values in the same shape. I don't know what to write. 😊

Perhaps I should change "Convert the frame ..." to "Return a SciPy sparse matrix ..." as nothing happens in-place.

How does this compare to SparseSeries.to_coo then?

In [6]: >>> from numpy import nan ...: >>> s = Series([3.0, nan, 1.0, 3.0, nan, nan]) ...: >>> s.index = MultiIndex.from_tuples([(1, 2, 'a', 0), ...: (1, 2, 'a', 1), ...: (1, 1, 'b', 0), ...: (1, 1, 'b', 1), ...: (2, 1, 'b', 0), ...: (2, 1, 'b', 1)], ...: names=['A', 'B', 'C', 'D']) ...: >>> ss = s.to_sparse() ...: In [7]: ss Out[7]: A B C D 1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.0 2 1 b 0 NaN 1 NaN dtype: float64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 2], dtype=int32) In [8]: ss.to_coo(row_levels=['A', 'B'],column_levels=['C', 'D'],sort_labels=True) Out[8]: (<3x4 sparse matrix of type '<class 'numpy.float64'>' with 3 stored elements in COOrdinate format>, [(1, 1), (1, 2), (2, 1)], [('a', 0), ('a', 1), ('b', 0), ('b', 1)])

Oh, at first glance, it's much simpler!

In your case:

>>> ss.to_coo() --------------------------------------------------------------------- ValueError: Is not a partition because union is not the whole. >>> # ???

With proposed (shown only for 2d case — as higher-d cases don't work):

>>> from scipy.sparse import coo_matrix >>> sdf = pd.SparseDataFrame(coo_matrix([[0, 1, 0], ... [1, 0, 0]])) >>> sdf # shit is a dataframe 0 1 2 0 NaN 1.0 NaN 1 1.0 NaN NaN >>> sdf[0] # behaves like a dataframe 0 NaN 1 1.0 Name: 0, dtype: float64 BlockIndex Block locations: array([1], dtype=int32) Block lengths: array([1], dtype=int32) >>> sdf.to_coo() # no complex arguments <2x3 sparse matrix of type '<class 'numpy.float64'>' with 2 stored elements in COOrdinate format> >>> # gives back the kind of 2d matrix it was given >>> # no COOrdinate (multi)index computing by hand >>> sdf[[0, 1]].to_coo().toarray() # just works array([[0, 1], [1, 0]])

That was the idea, at least. 😃

ok. btw. don't necessarily need to touch SparseSeries.to_coo (or as I mentioned before .from_coo in this PR. (of course its somewhat better / more consistent). More that we have a plan on what to do.

jreback · 2017-03-01T18:41:40Z

pandas/sparse/frame.py

+
+        Notes
+        -----
+        The dtype will be the lowest-common-denominator type (implicit


mention that this is efficient (is it? does it copy? and if so what), mainly nice to have docs about this (and could refer main docs here as well).

As efficient as ndarray.astype() and np.concatenate() are. Iterated on the number of columns as that was the easiest for me to understand. Doesn't make an unnecessary copy since copy=False was added just now. Thanks.

What do you mean by referring main docs?

you can put a link in the doc-string itself to something in the main docs (a url). If you want to point to something more involved (this is an option).

Dunno. I just copied these Notes from NDFrame.values, adapting a slight bit.

jreback · 2017-03-01T18:42:09Z

pandas/tests/sparse/test_frame.py

@@ -1118,6 +1120,28 @@ def test_isnotnull(self):
                            'B': [True, False, True, True, False]})
        tm.assert_frame_equal(res.to_dense(), exp)

+    @pytest.mark.parameterize('index', [None, list('abc')])


Not nice at all! Doesn't work! Without arguments' default values (...=None), all I'm getting is:

TypeError: test_from_to_scipy() missing 3 required positional arguments: 'index', 'columns', and 'fill_value'

(With default values (...=None), those are the only values ever tested with, i.e. no param matrix.

Avoiding to read the whole of pytest docs, any clue as to where's the wrong I'm doing?

no, what you need to do is create this as a function (and not put it in the class). parameterization doesn't work if:
class Foo(tm.TestCase) which is how we start all of the classes. So easiest to make a stand alone function. You can change it to class Foo(object), but then all of the self.assert* methods have to be changed to pytest style.

I have done that in a few cases.

Right, got it. I guess it can't be a method on a unittest.TestCase instance.

jreback · 2017-03-01T18:42:59Z

pandas/tests/sparse/test_frame.py

+    def test_from_to_scipy(self, index=None, columns=None, fill_value=None):
+        # GH 4343
+        tm._skip_if_no_scipy()
+        from scipy.sparse import csr_matrix


can you also make a test with exercises all of the scipy input matrix types? (you may want to make these as fixtures for convenience)

jreback · 2017-03-01T18:43:20Z

pandas/types/common.py

@@ -59,6 +59,21 @@ def is_sparse(array):
    return isinstance(array, (ABCSparseArray, ABCSparseSeries))


+# oh the troubles to reduce import time


put at the top (the _spmatrix / comment)

jreback · 2017-03-01T18:43:54Z

pandas/types/common.py

+    if _spmatrix is None:
+        try:
+            from scipy.sparse import spmatrix as _spmatrix
+        except ImportError:


can you add a test for this (pandas.types.tests.infererence) i think.

jreback · 2017-03-01T19:22:27Z

you'll want to rebase on master as travis was having some issues

kernc · 2017-03-06T16:16:11Z

Please give it another look. Besides addressing all comments, I made two major changes since the last iteration, namely:

checking (testing) that dtypes are preserved when possible,
the default_fill_value when unspecified set to 0 when initialized with a spmatrix as this is the implicit missing value in scipy sparse matrices.

kernc · 2017-03-06T16:17:06Z

pandas/sparse/array.py

+        mask = np.not_equal(arr, fill_value)
+        # In NumPy 1.12.0, not implemented at least for arr.dtype=str
+        if mask is NotImplemented:
+            mask = np.not_equal(arr.astype(object), fill_value)


I needed to change this bit in order to support adding a str (object) column to a SparseDataFrame inited with default_fill_value=0. See here: https://github.com/pandas-dev/pandas/pull/15497/files#diff-b30cded911d05ab8f4b2ab86c7dcab58R1169
Otherwise it would crash with some uninformative and unexpected error.

don't use ``np.not_equal . what are you trying to do here?

FYI #15587 is going to validate fill values (but of course let's fix here for what you need).

isinstance(fill_value, compat.string_types) is idiom for checking string-likes (if its a scalar)

Using np.not_equal() as an alternative to !=. Operator != returns False comparing arr when dtype=str and 0, issuing a FutureWarning that eventually, comparison will be element-wise. np.not_equal() doesn't issue such a warning but returns NotImplemented. One works with what one has. 😃

So if I just want element-wise comparison every time, regardless of dtype, I'd better do:

if pd.types.common.is_string_dtype(arr): arr = arr.astype(object) mask = arr != fill_value ...

?

no we don't use not_equal at all it in fact doesn't work on object arrays (well maybe in 1.12 it does, but its not compat).

use pandas.types.missing.array_equivalent if you really need, though isnull isself already calls this.

again what are you trying to do compare an array w/o a scalar?

Well, mask = arr != fill_value test was there before since 2011. And this test fails when arr.dtype.kind == 'U' and type(fill_value) != str (i.e. comparing a vector of fixed-length strings to a non-string). Or at least mask gets set to scalar False instead of a vector of boolean values, which then fails a few lines later on. It breaks on this test line, which, in my opinon, should generally pass.

yeah the setting to scalar Falseis the problem here. you'd expect an array, but I think numpy is broken for this (we had a similar issue in groupby.filter). in any event just do what you need to pass and I'll take a look.

Right, continued in https://github.com/pandas-dev/pandas/pull/15497/files#r104474889.

kernc · 2017-03-06T17:40:14Z

pandas/sparse/array.py

+        # element-wise but just returns False if fill_value is not str,
+        # so cast to object comparison to be safe
+        if is_string_dtype(arr):
+            arr = arr.astype(object)


Perhaps the conversion should happen sooner, but certainly no later. Given the dtype-preserving behavior in SparseArray constructor, I doubt any sooner either.

jreback · 2017-03-06T17:42:11Z

pandas/types/common.py

@@ -59,6 +62,17 @@ def is_sparse(array):
    return isinstance(array, (ABCSparseArray, ABCSparseSeries))


+def is_scipy_sparse(array):
+    """ return if we are a scipy.sparse.spmatrix """


this will keep trying to import scipy each time that it is called if scipy can not be imported. You have to set _is_scipy_sparse to something (rather than None) if its None to start, so that it is always set here.

easiest would be to do

_is_scipy_sparse = lambda x: False so this will just work. (in the except)

The other thing is, shouldn't this be something like (otherwise it hides i think)

if _is_scipy_sparse is None: try: from scipy.sparse import issparse except ImportError: issparse = lambda x: False _is_scipy_sparse = issparse return _is_scipy_sparse(array)

Found it here: https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statement_Overhead Was worried about the same thing, but have tested in Py2 and Py3 that the global is indeed set correctly.

Try:

>>> def f(): ... global aaa ... import sys as aaa >>> f() >>> aaa <module 'sys' ...>

didn't know that
it makes sense but is non obvious

jreback · 2017-03-07T14:00:22Z

pandas/sparse/frame.py

@@ -77,29 +79,26 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None,
            data = {columns[0]: data}

        if default_fill_value is None:
-            default_fill_value = np.nan
+            default_fill_value = 0 if is_scipy_sparse(data) else np.nan


why is this?

Well, SciPy sparse structures imply missing values are zeros. It felt like a sane default, but really it's better to keep the default nan. At least that way the user can decide what fill values they want. (And I don't know how to "unfill" a sparse frame that has been fillna-d.)

jreback · 2017-03-07T14:03:20Z

pandas/tests/sparse/test_frame.py

+def spmatrix(request):
+    tm._skip_if_no_scipy()
+    from scipy import sparse
+    yield getattr(sparse, request.param + '_matrix')


you don't use yield here, just return the result. If this were a yield fixture, then you would use @pytest.yield_fixture, but no reason to use one as you don't need to do any cleanup actions.

jreback · 2017-03-07T14:04:10Z

pandas/tests/sparse/test_frame.py

+@pytest.mark.parametrize('index', [None, list('abc')])
+@pytest.mark.parametrize('columns', [None, list('def')])
+@pytest.mark.parametrize('fill_value', [None, 0, np.nan])
+def test_from_to_scipy(spmatrix, index, columns, fill_value):


I would parametrize over dtypes too (all ints, uints, supported floats, object, bool).

For the record, this makes a total of 1176 tests, running some 20s on a very fast machine.

woosh! ok if you want to cut it down a bit ok (also ok to leave this). our test suite is something like 400s to run (on 2 processors) anyhow.

Can trim to [object, bool, int, float, np.uint8, np.float16] (504 tests) and assume the rest work.

On the other hand, it was quite a pain to account for everything! But I think it's the generic way I'm testing in now that avoids the problems I was encountering at first.

so you can leave out float16 we don't test that anywhere anyhow (and barely support it).

yeah the rest look fine (though your comparisons maybe slightly tricky as int is platform dependent).

jreback · 2017-03-07T14:04:26Z

pandas/tests/sparse/test_frame.py

+    if isinstance(fill_value, float):
+        arr = arr.astype(float)
+        arr[arr == 0] = np.nan
+    # If fill_value is None, set it to 0 for ndarray as value 0 is also the


jreback · 2017-03-07T14:05:02Z

pandas/tests/sparse/test_frame.py

+    tm.assert_frame_equal(sdf.to_dense(), res.to_dense())
+    tm.assert_numpy_array_equal(res.values, arr)
+
+    # Assert spmatrices equal (operator!= not supported in scipy 0.8.0)


what does this comment mean? that's a really old version

Yet that version complains in CircleCI-run tests.

what? we don't install anything lower than 0.11

jreback · 2017-03-07T14:07:32Z

pandas/tests/types/test_inference.py

                                 _ensure_int32,
                                 _ensure_categorical)
 from pandas.types.missing import isnull
 from pandas.util import testing as tm

+from pandas.tests.sparse.test_frame import spmatrix  # noqa: F401


maybe just put spmatrix in pandas/tests/sparse/common.py (I would use conftest.py but not sure if these are picked up correctly)

Right, even with this change, lint complains spmatrix is imported but unused.

jreback · 2017-03-07T14:07:48Z

pandas/tests/types/test_inference.py

@@ -946,6 +949,12 @@ def test_nan_to_nat_conversions():
        assert (s[8].value == np.datetime64('NaT').astype(np.int64))


+def test_is_scipy_sparse(spmatrix):  # noqa: F811


why do you need the noqa on these?

Here, lint complains: F811 redefinition of unused 'spmatrix' from line 39

hmm yeah I guess it is 'removed' via the pytest mangling steps, odd. ok then.

That's the thing with DSLs. Everyone needs to learn them, including linters. 😃

jreback · 2017-03-08T18:40:46Z

doc/source/sparse.rst

+
+.. versionadded:: 0.16.0
+
+Additionally, a :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.


can you show an example of SparseSeries.to_coo() (maybe just slice sdf)

These lines are not an addition, Just slightly reworded from before. Usage examples abound below. I'd rather not make my own here as series.to_coo() currently doesn't work without arguments, which, to be honest, I'm having some trouble grasping.

jreback · 2017-03-08T18:41:16Z

doc/source/whatsnew/v0.20.0.txt

+
+SciPy sparse matrix from/to SparseDataFrame
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information.


can you put the issue number at the end of this paragraph (after information)

jreback · 2017-03-08T18:42:02Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information.
+
+All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed.


can you add a link directly to the sparse formats where COOrdinate is? (make that the text)

Linked to sparse module where coo_matrix is like the fifth word vertically and COOrdinate is about approximately hcentered on the screen.

jreback · 2017-03-08T18:42:32Z

doc/source/sparse.rst

+   sdf = pd.SparseDataFrame(sp_arr)
+   sdf
+
+All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed. To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use :meth:`SparseDataFrame.to_coo` method:


same link (see below) here.

jreback · 2017-03-08T18:43:36Z

pandas/sparse/frame.py

@@ -25,6 +25,7 @@
                                   create_block_manager_from_arrays)
 import pandas.core.generic as generic
 from pandas.sparse.series import SparseSeries, SparseArray
+from pandas._sparse import BlockIndex, get_blocks


from pandas.sparse.libsparse import BlockIndex, get_blocks (just changed)

jreback · 2017-03-08T18:46:18Z

pandas/tests/sparse/test_frame.py

+                 dtype)
+    tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)})
+    tm.assert_equal(sdf.to_coo().dtype, res_dtype)
+


k this seems reasonable

jreback · 2017-03-08T18:47:26Z

pandas/tests/sparse/test_frame.py

+    was_upcast = ((fill_value is None or is_float(fill_value)) and
+                  not is_object_dtype(dtype) and
+                  not is_float_dtype(dtype))
+    res_dtype = (bool if pd.types.common.is_bool_dtype(dtype) else


you can import this up top

jreback · 2017-03-08T18:49:35Z

just some minor comments actually. lgtm. ping when all green and I think can merge. Will need to follow up to fix SparseSeries anyhow (to accept in the constructor) so can catch any fixes there.

kernc · 2017-03-10T00:00:03Z

So this is green and kinda works. Feel free to edit at will.

jreback · 2017-03-10T00:05:27Z

@kernc will have a look tomorrow and merge.

jreback · 2017-03-10T00:05:45Z

@sinhrks comments?

jreback · 2017-03-10T14:37:11Z

thanks @kernc great PR!

love for you to take #15634 !

jreback · 2017-03-10T23:03:34Z

@kernc not a big deal but https://travis-ci.org/pandas-dev/pandas/jobs/209914478

.../home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/scipy/sparse/sputils.py:114: UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")
......................................................................................................................................................................................................................................................................................................................................................................./home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/scipy/sparse/sputils.py:114: UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")

jreback · 2017-03-10T23:03:52Z

should prob skip that test then (or catch warning)

kernc · 2017-03-11T02:20:59Z

What seems to not be supported are ufuncs and binops upon them. Right, will skip it in #15634 if it can wait till next week.

jreback · 2017-03-11T07:28:49Z

np thanks

closes pandas-dev#4343 Author: Kernc <[email protected]> Closes pandas-dev#15497 from kernc/scipy-sparse and squashes the following commits: a0f2208 [Kernc] DOC: Fix some whatsnew/v0.20.0.txt sphinx warnings e72e594 [Kernc] ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame

jorisvandenbossche · 2018-10-23T16:33:52Z

@kernc didn't go through the full reviewing discussion here, but was it ever discussed what the default fill value should be when converting a scipy sparse matrix to SparseDataFrame?

Because now, it actually "changes" your data due to the default fill value of NaN:

In [27]: a = scipy.sparse.coo_matrix([[0., 1.], [2., 0.]])

In [28]: a.toarray()
Out[28]: 
array([[ 0.,  1.],
       [ 2.,  0.]])

In [29]: df = pd.SparseDataFrame(a)

In [30]: df
Out[30]: 
     0    1
0  NaN  1.0
1  2.0  NaN

In [31]: np.array(df)
Out[31]: 
array([[ nan,   1.],
       [  2.,  nan]])

I would expect the SparseDataFrame constructor to respect the implicit fill value of a scipy sparse matrix, which is 0.

jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement Sparse Sparse Data Type labels Feb 24, 2017

jreback requested changes Feb 24, 2017

View reviewed changes

jorisvandenbossche requested a review from sinhrks February 27, 2017 08:57

jorisvandenbossche reviewed Feb 27, 2017

View reviewed changes

kernc force-pushed the scipy-sparse branch from 3b654c8 to c2adfcc Compare March 1, 2017 17:23

kernc commented Mar 1, 2017

View reviewed changes

jreback reviewed Mar 1, 2017

View reviewed changes

jreback requested changes Mar 1, 2017

View reviewed changes

kernc force-pushed the scipy-sparse branch 2 times, most recently from 166b3bb to 2ddc7d6 Compare March 6, 2017 14:59

kernc force-pushed the scipy-sparse branch from 2ddc7d6 to a1a88ff Compare March 6, 2017 16:04

kernc commented Mar 6, 2017

View reviewed changes

jreback reviewed Mar 6, 2017

View reviewed changes

kernc force-pushed the scipy-sparse branch from d8c2b6e to b66294e Compare March 7, 2017 00:37

jreback requested changes Mar 7, 2017

View reviewed changes

jreback requested changes Mar 8, 2017

View reviewed changes

kernc force-pushed the scipy-sparse branch 4 times, most recently from 0af8935 to 99f826a Compare March 9, 2017 15:18

jreback mentioned this pull request Mar 9, 2017

API/DEPR: deprecate SparseSeries.from_coo and accept in constructor #15634

Closed

kernc added 2 commits March 9, 2017 17:58

ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame

e72e594

DOC: Fix some whatsnew/v0.20.0.txt sphinx warnings

a0f2208

kernc force-pushed the scipy-sparse branch from 99f826a to a0f2208 Compare March 9, 2017 16:58

jreback approved these changes Mar 10, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Mar 10, 2017

jreback closed this in 1be66ac Mar 10, 2017

		@@ -175,6 +176,33 @@ def _init_dict(self, data, index, columns, dtype=None):

		def _init_matrix(self, data, index, columns, dtype=None):
		data = _prep_ndarray(data, copy=False)


		A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
		Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices.


		.. versionadded:: 0.16.0

		Additionally, an experimental :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.

		@@ -59,6 +59,21 @@ def is_sparse(array):
		return isinstance(array, (ABCSparseArray, ABCSparseSeries))


		# oh the troubles to reduce import time

		@@ -946,6 +949,12 @@ def test_nan_to_nat_conversions():
		assert (s[8].value == np.datetime64('NaT').astype(np.int64))


		def test_is_scipy_sparse(spmatrix): # noqa: F811


		.. versionadded:: 0.16.0

		Additionally, a :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.

ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497

ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497

Conversation

kernc commented Feb 24, 2017 • edited Loading

kernc commented Feb 24, 2017

jreback commented Feb 24, 2017

jreback commented Feb 24, 2017

jreback commented Feb 24, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Feb 27, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 27, 2017

jreback commented Feb 27, 2017

codecov-io commented Mar 1, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Mar 1, 2017 • edited Loading

jreback commented Mar 1, 2017

kernc commented Mar 1, 2017 • edited Loading

jreback commented Mar 1, 2017

kernc commented Mar 1, 2017

jreback commented Mar 1, 2017 • edited Loading

jreback commented Mar 1, 2017

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc Mar 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc Mar 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 1, 2017

kernc commented Mar 6, 2017 • edited Loading

kernc Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Feb 24, 2017 •

edited

Loading

codecov-io commented Mar 1, 2017 •

edited

Loading

kernc commented Mar 1, 2017 •

edited

Loading

kernc commented Mar 1, 2017 •

edited

Loading

jreback commented Mar 1, 2017 •

edited

Loading

kernc Mar 2, 2017 •

edited

Loading

kernc Mar 2, 2017 •

edited

Loading

kernc commented Mar 6, 2017 •

edited

Loading

kernc Mar 6, 2017 •

edited

Loading

jreback Mar 6, 2017 •

edited

Loading

jreback Mar 6, 2017 •

edited

Loading

kernc Mar 8, 2017 •

edited

Loading