-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497
Conversation
The tests require scipy, however scipy is not a pandas dependency. What do? |
scipy is already included in many of the test runs e.g. https://github.com/pandas-dev/pandas/blob/master/ci/requirements-2.7.run you need to use tm._skip_if_no_scipy in your tests (assume u don't need a specific version required) in code it's a bit more tricky - l will comment on review in a bit |
see for example .interpolate where we dispatch to scipy (or raise an error if it's not installed) for some methods |
so this API is a bit at odds with what was done in So for API consistency, I think it makes sense to change @wesm IIRC you had some thoughts w.r.t. to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally good, but some pandas style requirements (on the code) and methodologies
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
SciPy sparse matrix from/to SparseDataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Pandas now supports creating SparseDataFrames directly from ``scipy.sparse.spmatrix`` instances. E.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number after instanes. Remove the 'E.g'.
doc/source/whatsnew/v0.20.0.txt
Outdated
sdf = pd.DataFrame(sp_arr) | ||
sdf | ||
|
||
To convert a SparseDataFrame back to scipy sparse matrix in COO format, you can use: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this example to the sparse docs as well.
pandas/sparse/frame.py
Outdated
from pandas.util.decorators import Appender | ||
import pandas.core.ops as ops | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't want to do an import here, which will affect all of pandas.
- create a function
pandas.types.common.is_scipy_sparse
then can either:
-
do the import (and thus check
isinstance
), you may need to jump some hoops to avoid repeatedly trying this import (you make a_try_import_scipy_sparse
function), then assign the results to a module level variable on first use. see inpandas.io.pytables
how_tables()
works. -
OR, and maybe better, if scipy sparse matrixes are duck-type inferable, IOW, if its possible to check if they have a couple of key attributes then I would just do this.
pandas/sparse/frame.py
Outdated
elif isinstance(data, BlockManager): | ||
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns), | ||
dtype=dtype, copy=copy) | ||
elif isinstance(data, spmatrix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this will change to is_scipy_sparse_matrix
pandas/sparse/frame.py
Outdated
@@ -175,6 +176,33 @@ def _init_dict(self, data, index, columns, dtype=None): | |||
|
|||
def _init_matrix(self, data, index, columns, dtype=None): | |||
data = _prep_ndarray(data, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a 1-liner doc (what this function is doing)
pandas/sparse/frame.py
Outdated
|
||
data = dict([(idx, data[:, i]) for i, idx in enumerate(columns)]) | ||
return self._init_dict(data, index, columns, dtype) | ||
def as_matrix(self, columns=None, sparse=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you are changing / adding this. BUT this is actually a major API change. This is what .values
calls and as a result the return types are: dense numpy arrays / pandas extension types (categorical).
Not against this (in fact I think its a good idea), but split this out into a separate issue. Then the impl would actually be done in the block manager.
pandas/sparse/frame.py
Outdated
def to_coo(self): | ||
""" | ||
Convert the frame to its SciPy sparse COO matrix representation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded
pandas/tests/sparse/test_frame.py
Outdated
def test_from_to_scipy(self): | ||
# GH 4343 | ||
try: | ||
from scipy.sparse import csr_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm._skip_if_no_scipy
pandas/tests/sparse/test_frame.py
Outdated
for index, columns in ((list('abc'), list('def')), | ||
(None, None)): | ||
sdf = pd.SparseDataFrame(spm, index=index, columns=columns) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather you actually construct the SDF directly and simply use tm.assert_sp_frame_equal
which already does all of this.
Further check for various fill types as well.
I would write this using pytest.parametrize
(as our new way of doing things)
pandas/sparse/frame.py
Outdated
return self._init_dict(data, index, columns, dtype) | ||
|
||
def _init_spmatrix(self, data, index, columns, dtype=None): | ||
index, columns = self._prep_index(data, index, columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-liner here as well
I don't have too many opinions on this feature |
doc/source/whatsnew/v0.20.0.txt
Outdated
arr[arr < .9] = 0 | ||
sp_arr = csr_matrix(arr) | ||
sp_arr | ||
sdf = pd.DataFrame(sp_arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be SparseDataFrame(sp_arr)
instead of DataFrame?
Never use sparse so also not much opinion, but a question: there already exists a |
well @jorisvandenbossche that's exactly the question. just because we have |
Codecov Report
@@ Coverage Diff @@
## master #15497 +/- ##
==========================================
- Coverage 91.02% 90.99% -0.03%
==========================================
Files 143 143
Lines 49324 49369 +45
==========================================
+ Hits 44897 44925 +28
- Misses 4427 4444 +17
Continue to review full report at Codecov.
|
pandas/types/common.py
Outdated
def is_scipy_sparse(array): | ||
""" return if we are a scipy.sparse.spmatrix """ | ||
global _spmatrix | ||
if _spmatrix is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoided importing in a separate function as this seemed concise enough.
Am not aware of any features that would unambiguously discern scipy sparse matrices. We'd have to check for what we use later on, namely .shape
and .tocoo(copy: bool) -> scipy.sparse.coo_matrix
. Checking the type directly looks cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah ok that's fine
pandas/tests/sparse/test_frame.py
Outdated
@@ -1118,6 +1120,28 @@ def test_isnotnull(self): | |||
'B': [True, False, True, True, False]}) | |||
tm.assert_frame_equal(res.to_dense(), exp) | |||
|
|||
@pytest.mark.importorskip('scipy.sparse') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used tm._skip_if_no_scipy()
first, but then opted for this. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's fine if on that test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually on 2nd thought, use the tm._skip....
as we are consistently using that. (does the same thing)
|
||
A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. | ||
Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, versionadded parts are appended at the end, but I thought this prominent, hopefully non-experimental feature should be exposed at the top ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that's fine, just mention, starting in 0.20.0 (or can use a versionadded tag)
What way is that? SparseSeries accepting 1×N sparse matrices in the constructor?
There are other types of sparse matrices and while the user could be asked to convert to a specific format, why should they be forced to. Conversion from one sparse format to another is cheap (relative to any operation that makes anything dense). Several spmatrix methods do the conversion to the optimal type for the operation internally as needed. And spmatrix objects are still de facto sparse structures in this stack, so accepting them in the constructor would not be an unexpected functionality. |
so the question is to we have several I think it should be consistent 1 way or the other. (e.g. NO implicit construction in the constructor / maybe raise an error actually, with |
There are no relevant As is, we already support all scipy.sparse.spmatrix subclasses through the constructor, so individual per-type |
@kernc so I think you answered my question. If we can unambiguously accept scipy sparse matricies w/o having to have a |
But deprecate Basically, whatever but just accepting stuff in the constructor? |
Simply moving the functionailty into the constuctor itself (almost exactly as it is). This is for IOW, something like
|
@kernc of course you may still want these routines if you need to control the conversion process in some way; though we maybe could add those as optional params to the constructor themselves. |
doc/source/sparse.rst
Outdated
|
||
.. versionadded:: 0.16.0 | ||
|
||
Additionally, an experimental :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can drop experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. few comments.
.. _whatsnew_0200.enhancements.scipy_sparse: | ||
|
||
SciPy sparse matrix from/to SparseDataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref to the new docs you created
doc/source/whatsnew/v0.20.0.txt
Outdated
SciPy sparse matrix from/to SparseDataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention if this doesn't work with particular scipy types (and in docs too)
pandas/sparse/frame.py
Outdated
def _init_spmatrix(self, data, index, columns, dtype=None): | ||
""" Init self from scipy.sparse matrix """ | ||
index, columns = self._prep_index(data, index, columns) | ||
data = data.tocoo(copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I see, tricky! does this have untoward side effects? (e.g. for certain matrices it is bad, meaning it doesn't work or forces a copy?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for any scipy.sparse.spmatrix
. As per docs, it makes a copy if it needs to, but this is a copy of the sparse data, which is relatively cheap. Sparse matrix methods in SciPy do the conversion to the optimal type for the operation internally as needed, e.g. see coo_matrix.maximum
, lil_matrix.getcol
, dok_matrix.__mul__
via spmatrix._mul_sparse_matrix
...
I am noting the transparent conversion (and potential copy) in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that is fine
pandas/sparse/frame.py
Outdated
data = data.tocoo(copy=False) | ||
N = len(index) | ||
bindex = np.arange(N, dtype=np.int32) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a 1-liner on what you are doing
pandas/sparse/frame.py
Outdated
return self._init_dict(data, index, columns, dtype) | ||
def to_coo(self): | ||
""" | ||
Convert the frame to its SciPy sparse COO matrix representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this require any special index? mention either way (and what happens to the index anyhow)? what about column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no index (either top or side). A simple 2-D spmatrix (ndim==2) matrix is returned, laid out as the frame was. It's like .values
, but for retaining sparsity.
What do I write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, my point is that the index (of the DF) is discarded (well maybe the column names are kept?) Its find, just need to point that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SciPy matrices are indexed (rows and columns) sequentially 0..(N-1), 0..(M-1), like ndarrays. Obviously, none of the DF indices are retained on the coo_matrix, only the raw values in the same shape. I don't know what to write. 😊
Perhaps I should change "Convert the frame ..." to "Return a SciPy sparse matrix ..." as nothing happens in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to SparseSeries.to_coo
then?
In [6]: >>> from numpy import nan
...: >>> s = Series([3.0, nan, 1.0, 3.0, nan, nan])
...: >>> s.index = MultiIndex.from_tuples([(1, 2, 'a', 0),
...: (1, 2, 'a', 1),
...: (1, 1, 'b', 0),
...: (1, 1, 'b', 1),
...: (2, 1, 'b', 0),
...: (2, 1, 'b', 1)],
...: names=['A', 'B', 'C', 'D'])
...: >>> ss = s.to_sparse()
...:
In [7]: ss
Out[7]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)
In [8]: ss.to_coo(row_levels=['A', 'B'],column_levels=['C', 'D'],sort_labels=True)
Out[8]:
(<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>,
[(1, 1), (1, 2), (2, 1)],
[('a', 0), ('a', 1), ('b', 0), ('b', 1)])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, at first glance, it's much simpler!
In your case:
>>> ss.to_coo()
---------------------------------------------------------------------
ValueError: Is not a partition because union is not the whole.
>>> # ???
With proposed (shown only for 2d case — as higher-d cases don't work):
>>> from scipy.sparse import coo_matrix
>>> sdf = pd.SparseDataFrame(coo_matrix([[0, 1, 0],
... [1, 0, 0]]))
>>> sdf # shit is a dataframe
0 1 2
0 NaN 1.0 NaN
1 1.0 NaN NaN
>>> sdf[0] # behaves like a dataframe
0 NaN
1 1.0
Name: 0, dtype: float64
BlockIndex
Block locations: array([1], dtype=int32)
Block lengths: array([1], dtype=int32)
>>> sdf.to_coo() # no complex arguments
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
>>> # gives back the kind of 2d matrix it was given
>>> # no COOrdinate (multi)index computing by hand
>>> sdf[[0, 1]].to_coo().toarray() # just works
array([[0, 1],
[1, 0]])
That was the idea, at least. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. btw. don't necessarily need to touch SparseSeries.to_coo
(or as I mentioned before .from_coo
in this PR. (of course its somewhat better / more consistent). More that we have a plan on what to do.
|
||
Notes | ||
----- | ||
The dtype will be the lowest-common-denominator type (implicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention that this is efficient (is it? does it copy? and if so what), mainly nice to have docs about this (and could refer main docs here as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As efficient as ndarray.astype()
and np.concatenate()
are. Iterated on the number of columns as that was the easiest for me to understand. Doesn't make an unnecessary copy since copy=False
was added just now. Thanks.
What do you mean by referring main docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put a link in the doc-string itself to something in the main docs (a url). If you want to point to something more involved (this is an option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. I just copied these Notes from NDFrame.values, adapting a slight bit.
pandas/tests/sparse/test_frame.py
Outdated
@@ -1118,6 +1120,28 @@ def test_isnotnull(self): | |||
'B': [True, False, True, True, False]}) | |||
tm.assert_frame_equal(res.to_dense(), exp) | |||
|
|||
@pytest.mark.parameterize('index', [None, list('abc')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not nice at all! Doesn't work! Without arguments' default values (...=None
), all I'm getting is:
TypeError: test_from_to_scipy() missing 3 required positional arguments: 'index', 'columns', and 'fill_value'
(With default values (...=None
), those are the only values ever tested with, i.e. no param matrix.
Avoiding to read the whole of pytest docs, any clue as to where's the wrong I'm doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, what you need to do is create this as a function (and not put it in the class). parameterization doesn't work if:
class Foo(tm.TestCase)
which is how we start all of the classes. So easiest to make a stand alone function. You can change it to class Foo(object)
, but then all of the self.assert*
methods have to be changed to pytest
style.
I have done that in a few cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, got it. I guess it can't be a method on a unittest.TestCase
instance.
pandas/tests/sparse/test_frame.py
Outdated
def test_from_to_scipy(self, index=None, columns=None, fill_value=None): | ||
# GH 4343 | ||
tm._skip_if_no_scipy() | ||
from scipy.sparse import csr_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also make a test with exercises all of the scipy input matrix types? (you may want to make these as fixtures for convenience)
pandas/types/common.py
Outdated
@@ -59,6 +59,21 @@ def is_sparse(array): | |||
return isinstance(array, (ABCSparseArray, ABCSparseSeries)) | |||
|
|||
|
|||
# oh the troubles to reduce import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put at the top (the _spmatrix / comment)
pandas/types/common.py
Outdated
if _spmatrix is None: | ||
try: | ||
from scipy.sparse import spmatrix as _spmatrix | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for this (pandas.types.tests.infererence) i think.
you'll want to rebase on master as travis was having some issues |
166b3bb
to
2ddc7d6
Compare
Please give it another look. Besides addressing all comments, I made two major changes since the last iteration, namely:
|
pandas/sparse/array.py
Outdated
mask = np.not_equal(arr, fill_value) | ||
# In NumPy 1.12.0, not implemented at least for arr.dtype=str | ||
if mask is NotImplemented: | ||
mask = np.not_equal(arr.astype(object), fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to change this bit in order to support adding a str
(object
) column to a SparseDataFrame
inited with default_fill_value=0
. See here: https://github.com/pandas-dev/pandas/pull/15497/files#diff-b30cded911d05ab8f4b2ab86c7dcab58R1169
Otherwise it would crash with some uninformative and unexpected error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use ``np.not_equal . what are you trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #15587 is going to validate fill values (but of course let's fix here for what you need).
isinstance(fill_value, compat.string_types)
is idiom for checking string-likes (if its a scalar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using np.not_equal()
as an alternative to !=
. Operator !=
returns False
comparing arr
when dtype=str
and 0
, issuing a FutureWarning that eventually, comparison will be element-wise. np.not_equal()
doesn't issue such a warning but returns NotImplemented
. One works with what one has. 😃
So if I just want element-wise comparison every time, regardless of dtype, I'd better do:
if pd.types.common.is_string_dtype(arr):
arr = arr.astype(object)
mask = arr != fill_value
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no we don't use not_equal
at all it in fact doesn't work on object arrays (well maybe in 1.12 it does, but its not compat).
use pandas.types.missing.array_equivalent
if you really need, though isnull
isself already calls this.
again what are you trying to do compare an array w/o a scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, mask = arr != fill_value
test was there before since 2011. And this test fails when arr.dtype.kind == 'U' and type(fill_value) != str
(i.e. comparing a vector of fixed-length strings to a non-string). Or at least mask
gets set to scalar False
instead of a vector of boolean values, which then fails a few lines later on. It breaks on this test line, which, in my opinon, should generally pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the setting to scalar False
is the problem here. you'd expect an array, but I think numpy is broken for this (we had a similar issue in groupby.filter). in any event just do what you need to pass and I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, continued in https://github.com/pandas-dev/pandas/pull/15497/files#r104474889.
# element-wise but just returns False if fill_value is not str, | ||
# so cast to object comparison to be safe | ||
if is_string_dtype(arr): | ||
arr = arr.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the conversion should happen sooner, but certainly no later. Given the dtype-preserving behavior in SparseArray constructor, I doubt any sooner either.
@@ -59,6 +62,17 @@ def is_sparse(array): | |||
return isinstance(array, (ABCSparseArray, ABCSparseSeries)) | |||
|
|||
|
|||
def is_scipy_sparse(array): | |||
""" return if we are a scipy.sparse.spmatrix """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will keep trying to import scipy each time that it is called if scipy can not be imported. You have to set _is_scipy_sparse
to something (rather than None
) if its None
to start, so that it is always set here.
easiest would be to do
_is_scipy_sparse = lambda x: False
so this will just work. (in the except)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing is, shouldn't this be something like (otherwise it hides i think)
if _is_scipy_sparse is None:
try:
from scipy.sparse import issparse
except ImportError:
issparse = lambda x: False
_is_scipy_sparse = issparse
return _is_scipy_sparse(array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found it here: https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statement_Overhead Was worried about the same thing, but have tested in Py2 and Py3 that the global is indeed set correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try:
>>> def f():
... global aaa
... import sys as aaa
>>> f()
>>> aaa
<module 'sys' ...>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't know that
it makes sense but is non obvious
pandas/sparse/frame.py
Outdated
@@ -77,29 +79,26 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None, | |||
data = {columns[0]: data} | |||
|
|||
if default_fill_value is None: | |||
default_fill_value = np.nan | |||
default_fill_value = 0 if is_scipy_sparse(data) else np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, SciPy sparse structures imply missing values are zeros. It felt like a sane default, but really it's better to keep the default nan. At least that way the user can decide what fill values they want. (And I don't know how to "unfill" a sparse frame that has been fillna
-d.)
pandas/tests/sparse/test_frame.py
Outdated
def spmatrix(request): | ||
tm._skip_if_no_scipy() | ||
from scipy import sparse | ||
yield getattr(sparse, request.param + '_matrix') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't use yield
here, just return the result. If this were a yield fixture, then you would use @pytest.yield_fixture
, but no reason to use one as you don't need to do any cleanup actions.
pandas/tests/sparse/test_frame.py
Outdated
@pytest.mark.parametrize('index', [None, list('abc')]) | ||
@pytest.mark.parametrize('columns', [None, list('def')]) | ||
@pytest.mark.parametrize('fill_value', [None, 0, np.nan]) | ||
def test_from_to_scipy(spmatrix, index, columns, fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would parametrize over dtypes too (all ints, uints, supported floats, object, bool).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, this makes a total of 1176 tests, running some 20s on a very fast machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woosh! ok if you want to cut it down a bit ok (also ok to leave this). our test suite is something like 400s to run (on 2 processors) anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can trim to [object, bool, int, float, np.uint8, np.float16]
(504 tests) and assume the rest work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, it was quite a pain to account for everything! But I think it's the generic way I'm testing in now that avoids the problems I was encountering at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can leave out float16
we don't test that anywhere anyhow (and barely support it).
yeah the rest look fine (though your comparisons maybe slightly tricky as int is platform dependent).
pandas/tests/sparse/test_frame.py
Outdated
if isinstance(fill_value, float): | ||
arr = arr.astype(float) | ||
arr[arr == 0] = np.nan | ||
# If fill_value is None, set it to 0 for ndarray as value 0 is also the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
pandas/tests/sparse/test_frame.py
Outdated
tm.assert_frame_equal(sdf.to_dense(), res.to_dense()) | ||
tm.assert_numpy_array_equal(res.values, arr) | ||
|
||
# Assert spmatrices equal (operator!= not supported in scipy 0.8.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this comment mean? that's a really old version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet that version complains in CircleCI-run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what? we don't install anything lower than 0.11
_ensure_int32, | ||
_ensure_categorical) | ||
from pandas.types.missing import isnull | ||
from pandas.util import testing as tm | ||
|
||
from pandas.tests.sparse.test_frame import spmatrix # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just put spmatrix
in pandas/tests/sparse/common.py
(I would use conftest.py
but not sure if these are picked up correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, even with this change, lint complains spmatrix is imported but unused.
@@ -946,6 +949,12 @@ def test_nan_to_nat_conversions(): | |||
assert (s[8].value == np.datetime64('NaT').astype(np.int64)) | |||
|
|||
|
|||
def test_is_scipy_sparse(spmatrix): # noqa: F811 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the noqa on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, lint complains: F811 redefinition of unused 'spmatrix' from line 39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I guess it is 'removed' via the pytest mangling steps, odd. ok then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing with DSLs. Everyone needs to learn them, including linters. 😃
|
||
.. versionadded:: 0.16.0 | ||
|
||
Additionally, a :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show an example of SparseSeries.to_coo()
(maybe just slice sdf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are not an addition, Just slightly reworded from before. Usage examples abound below. I'd rather not make my own here as series.to_coo()
currently doesn't work without arguments, which, to be honest, I'm having some trouble grasping.
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
SciPy sparse matrix from/to SparseDataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the issue number at the end of this paragraph (after information)
doc/source/whatsnew/v0.20.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances (:issue:`4343`). See the :ref:`documentation <sparse.scipysparse>` for more information. | ||
|
||
All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a link directly to the sparse formats where COOrdinate is? (make that the text)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked to sparse module where coo_matrix is like the fifth word vertically and COOrdinate is about approximately hcentered on the screen.
doc/source/sparse.rst
Outdated
sdf = pd.SparseDataFrame(sp_arr) | ||
sdf | ||
|
||
All sparse formats are supported, but matrices that aren't in *COOrdinate* format will be converted to it, copying the data as needed. To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use :meth:`SparseDataFrame.to_coo` method: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same link (see below) here.
pandas/sparse/frame.py
Outdated
@@ -25,6 +25,7 @@ | |||
create_block_manager_from_arrays) | |||
import pandas.core.generic as generic | |||
from pandas.sparse.series import SparseSeries, SparseArray | |||
from pandas._sparse import BlockIndex, get_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.sparse.libsparse import BlockIndex, get_blocks
(just changed)
dtype) | ||
tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)}) | ||
tm.assert_equal(sdf.to_coo().dtype, res_dtype) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k this seems reasonable
pandas/tests/sparse/test_frame.py
Outdated
was_upcast = ((fill_value is None or is_float(fill_value)) and | ||
not is_object_dtype(dtype) and | ||
not is_float_dtype(dtype)) | ||
res_dtype = (bool if pd.types.common.is_bool_dtype(dtype) else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can import this up top
just some minor comments actually. lgtm. ping when all green and I think can merge. Will need to follow up to fix SparseSeries anyhow (to accept in the constructor) so can catch any fixes there. |
0af8935
to
99f826a
Compare
So this is green and kinda works. Feel free to edit at will. |
@kernc will have a look tomorrow and merge. |
@sinhrks comments? |
@kernc not a big deal but https://travis-ci.org/pandas-dev/pandas/jobs/209914478
|
should prob skip that test then (or catch warning) |
What seems to not be supported are ufuncs and binops upon them. Right, will skip it in #15634 if it can wait till next week. |
np thanks |
closes pandas-dev#4343 Author: Kernc <[email protected]> Closes pandas-dev#15497 from kernc/scipy-sparse and squashes the following commits: a0f2208 [Kernc] DOC: Fix some whatsnew/v0.20.0.txt sphinx warnings e72e594 [Kernc] ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame
@kernc didn't go through the full reviewing discussion here, but was it ever discussed what the default fill value should be when converting a scipy sparse matrix to SparseDataFrame? Because now, it actually "changes" your data due to the default fill value of NaN:
I would expect the SparseDataFrame constructor to respect the implicit fill value of a scipy sparse matrix, which is 0. |
git diff upstream/master | flake8 --diff