Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497

Closed
wants to merge 2 commits into from

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Feb 24, 2017

@kernc
Copy link
Contributor Author

kernc commented Feb 24, 2017

The tests require scipy, however scipy is not a pandas dependency. What do?

@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

scipy is already included in many of the test runs e.g. https://github.com/pandas-dev/pandas/blob/master/ci/requirements-2.7.run

you need to use tm._skip_if_no_scipy in your tests (assume u don't need a specific version required)

in code it's a bit more tricky - l will comment on review in a bit

@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

see for example .interpolate where we dispatch to scipy (or raise an error if it's not installed) for some methods

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement Sparse Sparse Data Type labels Feb 24, 2017
@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

so this API is a bit at odds with what was done in SparseSeries.from_coo, meaning that we have a specific 'type' factory function. We don't actually follow this pattern anywhere else (e.g. even DataFrame.from_dict is just a small wrapper on the DataFrame constructor).

So for API consistency, I think it makes sense to change SparseSeries.from_coo in a similar way.

@wesm IIRC you had some thoughts w.r.t. to this.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally good, but some pandas style requirements (on the code) and methodologies


SciPy sparse matrix from/to SparseDataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pandas now supports creating SparseDataFrames directly from ``scipy.sparse.spmatrix`` instances. E.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number after instanes. Remove the 'E.g'.

sdf = pd.DataFrame(sp_arr)
sdf

To convert a SparseDataFrame back to scipy sparse matrix in COO format, you can use:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this example to the sparse docs as well.

from pandas.util.decorators import Appender
import pandas.core.ops as ops

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't want to do an import here, which will affect all of pandas.

  • create a function pandas.types.common.is_scipy_sparse

then can either:

  • do the import (and thus check isinstance), you may need to jump some hoops to avoid repeatedly trying this import (you make a _try_import_scipy_sparse function), then assign the results to a module level variable on first use. see in pandas.io.pytables how _tables() works.

  • OR, and maybe better, if scipy sparse matrixes are duck-type inferable, IOW, if its possible to check if they have a couple of key attributes then I would just do this.

elif isinstance(data, BlockManager):
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
dtype=dtype, copy=copy)
elif isinstance(data, spmatrix):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will change to is_scipy_sparse_matrix

@@ -175,6 +176,33 @@ def _init_dict(self, data, index, columns, dtype=None):

def _init_matrix(self, data, index, columns, dtype=None):
data = _prep_ndarray(data, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a 1-liner doc (what this function is doing)


data = dict([(idx, data[:, i]) for i, idx in enumerate(columns)])
return self._init_dict(data, index, columns, dtype)
def as_matrix(self, columns=None, sparse=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you are changing / adding this. BUT this is actually a major API change. This is what .values calls and as a result the return types are: dense numpy arrays / pandas extension types (categorical).

Not against this (in fact I think its a good idea), but split this out into a separate issue. Then the impl would actually be done in the block manager.

def to_coo(self):
"""
Convert the frame to its SciPy sparse COO matrix representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded

def test_from_to_scipy(self):
# GH 4343
try:
from scipy.sparse import csr_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm._skip_if_no_scipy

for index, columns in ((list('abc'), list('def')),
(None, None)):
sdf = pd.SparseDataFrame(spm, index=index, columns=columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you actually construct the SDF directly and simply use tm.assert_sp_frame_equal which already does all of this.

Further check for various fill types as well.

I would write this using pytest.parametrize (as our new way of doing things)

return self._init_dict(data, index, columns, dtype)

def _init_spmatrix(self, data, index, columns, dtype=None):
index, columns = self._prep_index(data, index, columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-liner here as well

@wesm
Copy link
Member

wesm commented Feb 27, 2017

I don't have too many opinions on this feature

arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sp_arr
sdf = pd.DataFrame(sp_arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SparseDataFrame(sp_arr) instead of DataFrame?

@jorisvandenbossche
Copy link
Member

Never use sparse so also not much opinion, but a question: there already exists a SparseSeries.from_coo method. Would it therefore make sense to have a SparseDataFrame.from_csr method instead of overloading the constructor?

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

well @jorisvandenbossche that's exactly the question. just because we have .from_coo doesn't mean which should have specific constructors. In fact we don't for anything else. So IMHO we should just fold these into the constructor (IOW, leave this PR alone and deprecate and change .from_coo

@codecov-io
Copy link

codecov-io commented Mar 1, 2017

Codecov Report

Merging #15497 into master will decrease coverage by 0.02%.
The diff coverage is 88.33%.

@@            Coverage Diff             @@
##           master   #15497      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49324    49369      +45     
==========================================
+ Hits        44897    44925      +28     
- Misses       4427     4444      +17
Impacted Files Coverage Δ
pandas/sparse/array.py 91.46% <100%> (+0.04%)
pandas/util/testing.py 80.92% <40%> (-0.19%)
pandas/types/common.py 93.93% <75%> (-0.53%)
pandas/sparse/frame.py 96.68% <95.45%> (-0.01%)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.76% <0%> (-0.35%)
pandas/types/cast.py 85.45% <0%> (-0.21%)
pandas/core/frame.py 97.87% <0%> (-0.1%)
pandas/core/common.py 91.3% <0%> (+0.33%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5667a3a...a0f2208. Read the comment docs.

def is_scipy_sparse(array):
""" return if we are a scipy.sparse.spmatrix """
global _spmatrix
if _spmatrix is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoided importing in a separate function as this seemed concise enough.

Am not aware of any features that would unambiguously discern scipy sparse matrices. We'd have to check for what we use later on, namely .shape and .tocoo(copy: bool) -> scipy.sparse.coo_matrix. Checking the type directly looks cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ok that's fine

@@ -1118,6 +1120,28 @@ def test_isnotnull(self):
'B': [True, False, True, True, False]})
tm.assert_frame_equal(res.to_dense(), exp)

@pytest.mark.importorskip('scipy.sparse')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used tm._skip_if_no_scipy() first, but then opted for this. Is that ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's fine if on that test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually on 2nd thought, use the tm._skip.... as we are consistently using that. (does the same thing)


A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, versionadded parts are appended at the end, but I thought this prominent, hopefully non-experimental feature should be exposed at the top ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's fine, just mention, starting in 0.20.0 (or can use a versionadded tag)

@kernc
Copy link
Contributor Author

kernc commented Mar 1, 2017

So for API consistency, I think it makes sense to change SparseSeries.from_coo in a similar way.

What way is that? SparseSeries accepting 1×N sparse matrices in the constructor?

Would it make sense to have a SparseDataFrame.from_csr method instead of overloading the constructor?

There are other types of sparse matrices and while the user could be asked to convert to a specific format, why should they be forced to. Conversion from one sparse format to another is cheap (relative to any operation that makes anything dense). Several spmatrix methods do the conversion to the optimal type for the operation internally as needed. And spmatrix objects are still de facto sparse structures in this stack, so accepting them in the constructor would not be an unexpected functionality.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

There are other types of sparse matrices and while the user could be asked to convert to a specific format, why should they be forced to. Conversion from one sparse format to another is cheap (relative to any operation that makes anything dense). Several spmatrix methods do the conversion to the optimal type for the operation internally as needed. And spmatrix objects are still de facto sparse structures in this stack, so accepting them in the constructor would not be an unexpected functionality.

so the question is to we have several .from_type_of_sparse_conversion routines, OR attempt to figure this out in the constructor?

I think it should be consistent 1 way or the other. (e.g. NO implicit construction in the constructor / maybe raise an error actually, with .from_* methods), or ALL constructions in the constructor.

@kernc
Copy link
Contributor Author

kernc commented Mar 1, 2017

There are no relevant from_* methods on SparseDataFrame currently, only one on SparseSeries, which I don't yet have an opinion on, except that it seems a bit cumbersome and it's not a frame (ndim != 2) whereas a spmatrix is (ndim==2).

As is, we already support all scipy.sparse.spmatrix subclasses through the constructor, so individual per-type from_* methods are unnecessary. But I'm happy to add a thin factory method SparseDataFrame.from_scipy (or equivalent) that calls the constructor if the consensus is this importantly aids readability. (I'm actually 👎. SciPy sparse matrices are as ubiquitous in this stack that accepting them in the SparseDataFrame constructor is as natural as accepting ndarrays in DataFrame's.)

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

As is, we already support all scipy.sparse.spmatrix subclasses through the constructor, so individual per-type from_* methods are unnecessary. But I'm happy to add a thin factory method SparseDataFrame.from_scipy (or equivalent) that calls the constructor if the consensus is this importantly aids readability. (I'm actually 👎. SciPy sparse matrices are as ubiquitous in this stack that accepting them in the SparseDataFrame constructor is as natural as accepting ndarrays in DataFrame's.)

@kernc so I think you answered my question. If we can unambiguously accept scipy sparse matricies w/o having to have a .from_* then I am all for it. So would then deprecateSparseSeries.from_coo, and move inside the constructor instead.

@kernc
Copy link
Contributor Author

kernc commented Mar 1, 2017

But deprecate SparseSeries.from_coo in favor of what? Accepting a 1×N scipy matrix in the constructor and making it a Series, and accepting a N×M scipy matrix and making it a Series indexed by a MultiIndex?

Basically, whatever but just accepting stuff in the constructor?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

But deprecate SparseSeries.from_coo in favor of what? Accepting a 1×N scipy matrix in the constructor and making it a Series, and accepting a N×M scipy matrix and making it a Series indexed by a MultiIndex?

Simply moving the functionailty into the constuctor itself (almost exactly as it is). This is for SparseSeries (and NOT SparseDataFrame)

IOW, something like

class SparseSeries(Series):

     ........

      if is_scipy_matrix(..):
           if is_coo_matrix(....):
                 # do what .from_coo does now
           else:
                raise NotImplentedError('on the todo list')

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

@kernc of course you may still want these routines if you need to control the conversion process in some way; though we maybe could add those as optional params to the constructor themselves.


.. versionadded:: 0.16.0

Additionally, an experimental :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can drop experimental

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. few comments.

.. _whatsnew_0200.enhancements.scipy_sparse:

SciPy sparse matrix from/to SparseDataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref to the new docs you created

SciPy sparse matrix from/to SparseDataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention if this doesn't work with particular scipy types (and in docs too)

def _init_spmatrix(self, data, index, columns, dtype=None):
""" Init self from scipy.sparse matrix """
index, columns = self._prep_index(data, index, columns)
data = data.tocoo(copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see, tricky! does this have untoward side effects? (e.g. for certain matrices it is bad, meaning it doesn't work or forces a copy?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for any scipy.sparse.spmatrix. As per docs, it makes a copy if it needs to, but this is a copy of the sparse data, which is relatively cheap. Sparse matrix methods in SciPy do the conversion to the optimal type for the operation internally as needed, e.g. see coo_matrix.maximum, lil_matrix.getcol, dok_matrix.__mul__ via spmatrix._mul_sparse_matrix ...

I am noting the transparent conversion (and potential copy) in the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that is fine

data = data.tocoo(copy=False)
N = len(index)
bindex = np.arange(N, dtype=np.int32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 1-liner on what you are doing

return self._init_dict(data, index, columns, dtype)
def to_coo(self):
"""
Convert the frame to its SciPy sparse COO matrix representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require any special index? mention either way (and what happens to the index anyhow)? what about column names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no index (either top or side). A simple 2-D spmatrix (ndim==2) matrix is returned, laid out as the frame was. It's like .values, but for retaining sparsity.

What do I write?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, my point is that the index (of the DF) is discarded (well maybe the column names are kept?) Its find, just need to point that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SciPy matrices are indexed (rows and columns) sequentially 0..(N-1), 0..(M-1), like ndarrays. Obviously, none of the DF indices are retained on the coo_matrix, only the raw values in the same shape. I don't know what to write. 😊

Perhaps I should change "Convert the frame ..." to "Return a SciPy sparse matrix ..." as nothing happens in-place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to SparseSeries.to_coo then?

In [6]: >>> from numpy import nan
   ...: >>> s = Series([3.0, nan, 1.0, 3.0, nan, nan])
   ...: >>> s.index = MultiIndex.from_tuples([(1, 2, 'a', 0),
   ...:                                       (1, 2, 'a', 1),
   ...:                                       (1, 1, 'b', 0),
   ...:                                       (1, 1, 'b', 1),
   ...:                                       (2, 1, 'b', 0),
   ...:                                       (2, 1, 'b', 1)],
   ...:                                       names=['A', 'B', 'C', 'D'])
   ...: >>> ss = s.to_sparse()
   ...: 

In [7]: ss
Out[7]: 
A  B  C  D
1  2  a  0    3.0
         1    NaN
   1  b  0    1.0
         1    3.0
2  1  b  0    NaN
         1    NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)

In [8]: ss.to_coo(row_levels=['A', 'B'],column_levels=['C', 'D'],sort_labels=True)
Out[8]: 
(<3x4 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in COOrdinate format>,
 [(1, 1), (1, 2), (2, 1)],
 [('a', 0), ('a', 1), ('b', 0), ('b', 1)])

Copy link
Contributor Author

@kernc kernc Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, at first glance, it's much simpler!

In your case:

>>> ss.to_coo()
---------------------------------------------------------------------
ValueError: Is not a partition because union is not the whole.

>>> # ???

With proposed (shown only for 2d case — as higher-d cases don't work):

>>> from scipy.sparse import coo_matrix
>>> sdf = pd.SparseDataFrame(coo_matrix([[0, 1, 0],
...                                      [1, 0, 0]]))

>>> sdf  # shit is a dataframe
     0    1   2
0  NaN  1.0 NaN
1  1.0  NaN NaN

>>> sdf[0]  # behaves like a dataframe
0    NaN
1    1.0
Name: 0, dtype: float64
BlockIndex
Block locations: array([1], dtype=int32)
Block lengths: array([1], dtype=int32)

>>> sdf.to_coo()  # no complex arguments
<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in COOrdinate format>

>>> # gives back the kind of 2d matrix it was given
>>> # no COOrdinate (multi)index computing by hand
>>> sdf[[0, 1]].to_coo().toarray()  # just works
array([[0, 1],
       [1, 0]])

That was the idea, at least. 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. btw. don't necessarily need to touch SparseSeries.to_coo (or as I mentioned before .from_coo in this PR. (of course its somewhat better / more consistent). More that we have a plan on what to do.


Notes
-----
The dtype will be the lowest-common-denominator type (implicit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention that this is efficient (is it? does it copy? and if so what), mainly nice to have docs about this (and could refer main docs here as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As efficient as ndarray.astype() and np.concatenate() are. Iterated on the number of columns as that was the easiest for me to understand. Doesn't make an unnecessary copy since copy=False was added just now. Thanks.

What do you mean by referring main docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put a link in the doc-string itself to something in the main docs (a url). If you want to point to something more involved (this is an option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. I just copied these Notes from NDFrame.values, adapting a slight bit.

@@ -1118,6 +1120,28 @@ def test_isnotnull(self):
'B': [True, False, True, True, False]})
tm.assert_frame_equal(res.to_dense(), exp)

@pytest.mark.parameterize('index', [None, list('abc')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor Author

@kernc kernc Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not nice at all! Doesn't work! Without arguments' default values (...=None), all I'm getting is:

TypeError: test_from_to_scipy() missing 3 required positional arguments: 'index', 'columns', and 'fill_value'

(With default values (...=None), those are the only values ever tested with, i.e. no param matrix.

Avoiding to read the whole of pytest docs, any clue as to where's the wrong I'm doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, what you need to do is create this as a function (and not put it in the class). parameterization doesn't work if:
class Foo(tm.TestCase) which is how we start all of the classes. So easiest to make a stand alone function. You can change it to class Foo(object), but then all of the self.assert* methods have to be changed to pytest style.

I have done that in a few cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, got it. I guess it can't be a method on a unittest.TestCase instance.

def test_from_to_scipy(self, index=None, columns=None, fill_value=None):
# GH 4343
tm._skip_if_no_scipy()
from scipy.sparse import csr_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also make a test with exercises all of the scipy input matrix types? (you may want to make these as fixtures for convenience)

@@ -59,6 +59,21 @@ def is_sparse(array):
return isinstance(array, (ABCSparseArray, ABCSparseSeries))


# oh the troubles to reduce import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put at the top (the _spmatrix / comment)

if _spmatrix is None:
try:
from scipy.sparse import spmatrix as _spmatrix
except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this (pandas.types.tests.infererence) i think.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

you'll want to rebase on master as travis was having some issues

@kernc kernc force-pushed the scipy-sparse branch 2 times, most recently from 166b3bb to 2ddc7d6 Compare March 6, 2017 14:59
@kernc
Copy link
Contributor Author

kernc commented Mar 6, 2017

Please give it another look. Besides addressing all comments, I made two major changes since the last iteration, namely:

  • checking (testing) that dtypes are preserved when possible,
  • the default_fill_value when unspecified set to 0 when initialized with a spmatrix as this is the implicit missing value in scipy sparse matrices.

mask = np.not_equal(arr, fill_value)
# In NumPy 1.12.0, not implemented at least for arr.dtype=str
if mask is NotImplemented:
mask = np.not_equal(arr.astype(object), fill_value)
Copy link
Contributor Author

@kernc kernc Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to change this bit in order to support adding a str (object) column to a SparseDataFrame inited with default_fill_value=0. See here: https://github.com/pandas-dev/pandas/pull/15497/files#diff-b30cded911d05ab8f4b2ab86c7dcab58R1169
Otherwise it would crash with some uninformative and unexpected error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use ``np.not_equal . what are you trying to do here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI #15587 is going to validate fill values (but of course let's fix here for what you need).

isinstance(fill_value, compat.string_types) is idiom for checking string-likes (if its a scalar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.not_equal() as an alternative to !=. Operator != returns False comparing arr when dtype=str and 0, issuing a FutureWarning that eventually, comparison will be element-wise. np.not_equal() doesn't issue such a warning but returns NotImplemented. One works with what one has. 😃

So if I just want element-wise comparison every time, regardless of dtype, I'd better do:

if pd.types.common.is_string_dtype(arr):
   arr = arr.astype(object)

mask = arr != fill_value
...

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't use not_equal at all it in fact doesn't work on object arrays (well maybe in 1.12 it does, but its not compat).

use pandas.types.missing.array_equivalent if you really need, though isnull isself already calls this.

again what are you trying to do compare an array w/o a scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, mask = arr != fill_value test was there before since 2011. And this test fails when arr.dtype.kind == 'U' and type(fill_value) != str (i.e. comparing a vector of fixed-length strings to a non-string). Or at least mask gets set to scalar False instead of a vector of boolean values, which then fails a few lines later on. It breaks on this test line, which, in my opinon, should generally pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the setting to scalar Falseis the problem here. you'd expect an array, but I think numpy is broken for this (we had a similar issue in groupby.filter). in any event just do what you need to pass and I'll take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# element-wise but just returns False if fill_value is not str,
# so cast to object comparison to be safe
if is_string_dtype(arr):
arr = arr.astype(object)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the conversion should happen sooner, but certainly no later. Given the dtype-preserving behavior in SparseArray constructor, I doubt any sooner either.

@@ -59,6 +62,17 @@ def is_sparse(array):
return isinstance(array, (ABCSparseArray, ABCSparseSeries))


def is_scipy_sparse(array):
""" return if we are a scipy.sparse.spmatrix """
Copy link
Contributor

@jreback jreback Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will keep trying to import scipy each time that it is called if scipy can not be imported. You have to set _is_scipy_sparse to something (rather than None) if its None to start, so that it is always set here.

easiest would be to do

_is_scipy_sparse = lambda x: False so this will just work. (in the except)

Copy link
Contributor

@jreback jreback Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is, shouldn't this be something like (otherwise it hides i think)

if _is_scipy_sparse is None:

    try:
        from scipy.sparse import issparse
    except ImportError:
        issparse = lambda x: False
    _is_scipy_sparse = issparse

return _is_scipy_sparse(array)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it here: https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statement_Overhead Was worried about the same thing, but have tested in Py2 and Py3 that the global is indeed set correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try:

>>> def f():
...     global aaa
...     import sys as aaa

>>> f()
>>> aaa
<module 'sys' ...>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't know that
it makes sense but is non obvious

@@ -77,29 +79,26 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None,
data = {columns[0]: data}

if default_fill_value is None:
default_fill_value = np.nan
default_fill_value = 0 if is_scipy_sparse(data) else np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this?

Copy link
Contributor Author

@kernc kernc Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, SciPy sparse structures imply missing values are zeros. It felt like a sane default, but really it's better to keep the default nan. At least that way the user can decide what fill values they want. (And I don't know how to "unfill" a sparse frame that has been fillna-d.)

def spmatrix(request):
tm._skip_if_no_scipy()
from scipy import sparse
yield getattr(sparse, request.param + '_matrix')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't use yield here, just return the result. If this were a yield fixture, then you would use @pytest.yield_fixture, but no reason to use one as you don't need to do any cleanup actions.

@pytest.mark.parametrize('index', [None, list('abc')])
@pytest.mark.parametrize('columns', [None, list('def')])
@pytest.mark.parametrize('fill_value', [None, 0, np.nan])
def test_from_to_scipy(spmatrix, index, columns, fill_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would parametrize over dtypes too (all ints, uints, supported floats, object, bool).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this makes a total of 1176 tests, running some 20s on a very fast machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woosh! ok if you want to cut it down a bit ok (also ok to leave this). our test suite is something like 400s to run (on 2 processors) anyhow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can trim to [object, bool, int, float, np.uint8, np.float16] (504 tests) and assume the rest work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it was quite a pain to account for everything! But I think it's the generic way I'm testing in now that avoids the problems I was encountering at first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can leave out float16 we don't test that anywhere anyhow (and barely support it).

yeah the rest look fine (though your comparisons maybe slightly tricky as int is platform dependent).

if isinstance(fill_value, float):
arr = arr.astype(float)
arr[arr == 0] = np.nan
# If fill_value is None, set it to 0 for ndarray as value 0 is also the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

tm.assert_frame_equal(sdf.to_dense(), res.to_dense())
tm.assert_numpy_array_equal(res.values, arr)

# Assert spmatrices equal (operator!= not supported in scipy 0.8.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this comment mean? that's a really old version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet that version complains in CircleCI-run tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? we don't install anything lower than 0.11

_ensure_int32,
_ensure_categorical)
from pandas.types.missing import isnull
from pandas.util import testing as tm

from pandas.tests.sparse.test_frame import spmatrix # noqa: F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just put spmatrix in pandas/tests/sparse/common.py (I would use conftest.py but not sure if these are picked up correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, even with this change, lint complains spmatrix is imported but unused.

@@ -946,6 +949,12 @@ def test_nan_to_nat_conversions():
assert (s[8].value == np.datetime64('NaT').astype(np.int64))


def test_is_scipy_sparse(spmatrix): # noqa: F811
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the noqa on these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, lint complains: F811 redefinition of unused 'spmatrix' from line 39

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I guess it is 'removed' via the pytest mangling steps, odd. ok then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the thing with DSLs. Everyone needs to learn them, including linters. 😃


.. versionadded:: 0.16.0

Additionally, a :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show an example of SparseSeries.to_coo() (maybe just slice sdf)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not an addition, Just slightly reworded from before. Usage examples abound below. I'd rather not make my own here as series.to_coo() currently doesn't work without arguments, which, to be honest, I'm having some trouble grasping.


SciPy sparse matrix from/to SparseDataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the issue number at the end of this paragraph (after information)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information.

All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link directly to the sparse formats where COOrdinate is? (make that the text)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked to sparse module where coo_matrix is like the fifth word vertically and COOrdinate is about approximately hcentered on the screen.

sdf = pd.SparseDataFrame(sp_arr)
sdf

All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed. To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use :meth:`SparseDataFrame.to_coo` method:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same link (see below) here.

@@ -25,6 +25,7 @@
create_block_manager_from_arrays)
import pandas.core.generic as generic
from pandas.sparse.series import SparseSeries, SparseArray
from pandas._sparse import BlockIndex, get_blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.sparse.libsparse import BlockIndex, get_blocks (just changed)

dtype)
tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)})
tm.assert_equal(sdf.to_coo().dtype, res_dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k this seems reasonable

was_upcast = ((fill_value is None or is_float(fill_value)) and
not is_object_dtype(dtype) and
not is_float_dtype(dtype))
res_dtype = (bool if pd.types.common.is_bool_dtype(dtype) else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can import this up top

@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

just some minor comments actually. lgtm. ping when all green and I think can merge. Will need to follow up to fix SparseSeries anyhow (to accept in the constructor) so can catch any fixes there.

@kernc
Copy link
Contributor Author

kernc commented Mar 10, 2017

So this is green and kinda works. Feel free to edit at will.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

@kernc will have a look tomorrow and merge.

@jreback jreback added this to the 0.20.0 milestone Mar 10, 2017
@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

@sinhrks comments?

@jreback jreback closed this in 1be66ac Mar 10, 2017
@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

thanks @kernc great PR!

love for you to take #15634 !

@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

@kernc not a big deal but https://travis-ci.org/pandas-dev/pandas/jobs/209914478

.../home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/scipy/sparse/sputils.py:114: UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")
......................................................................................................................................................................................................................................................................................................................................................................./home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/scipy/sparse/sputils.py:114: UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")

@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

should prob skip that test then (or catch warning)

@kernc
Copy link
Contributor Author

kernc commented Mar 11, 2017

What seems to not be supported are ufuncs and binops upon them. Right, will skip it in #15634 if it can wait till next week.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2017

np thanks

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#4343

Author: Kernc <[email protected]>

Closes pandas-dev#15497 from kernc/scipy-sparse and squashes the following commits:

a0f2208 [Kernc] DOC: Fix some whatsnew/v0.20.0.txt sphinx warnings
e72e594 [Kernc] ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 23, 2018

@kernc didn't go through the full reviewing discussion here, but was it ever discussed what the default fill value should be when converting a scipy sparse matrix to SparseDataFrame?

Because now, it actually "changes" your data due to the default fill value of NaN:

In [27]: a = scipy.sparse.coo_matrix([[0., 1.], [2., 0.]])

In [28]: a.toarray()
Out[28]: 
array([[ 0.,  1.],
       [ 2.,  0.]])

In [29]: df = pd.SparseDataFrame(a)

In [30]: df
Out[30]: 
     0    1
0  NaN  1.0
1  2.0  NaN

In [31]: np.array(df)
Out[31]: 
array([[ nan,   1.],
       [  2.,  nan]])

I would expect the SparseDataFrame constructor to respect the implicit fill value of a scipy sparse matrix, which is 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: native sparse conversion from csc/csr scipy sparse matrix to DataFrame
5 participants