-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Native conversion from/to scipy.sparse matrix to SparseDataFrame #15497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -186,9 +186,32 @@ the correct dense result. | |
Interaction with scipy.sparse | ||
----------------------------- | ||
|
||
Experimental api to transform between sparse pandas and scipy.sparse structures. | ||
.. versionadded:: 0.20.0 | ||
|
||
A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. | ||
Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices. | ||
|
||
.. ipython:: python | ||
|
||
from scipy.sparse import csr_matrix | ||
|
||
arr = np.random.random(size=(1000, 5)) | ||
arr[arr < .9] = 0 | ||
|
||
sp_arr = csr_matrix(arr) | ||
sp_arr | ||
|
||
sdf = pd.SparseDataFrame(sp_arr) | ||
sdf | ||
|
||
All sparse formats are supported, but matrices that aren't in :mod:`COOrdinate <scipy.sparse>` format will be converted to it, copying the data as needed. To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use :meth:`SparseDataFrame.to_coo` method: | ||
|
||
.. ipython:: python | ||
|
||
sdf.to_coo() | ||
|
||
.. versionadded:: 0.16.0 | ||
|
||
Additionally, a :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you show an example of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These lines are not an addition, Just slightly reworded from before. Usage examples abound below. I'd rather not make my own here as |
||
|
||
The method requires a ``MultiIndex`` with two or more levels. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -150,7 +150,7 @@ New Behavior: | |
|
||
df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum() | ||
|
||
.. _whatsnew_0200.enhancements.table_schema | ||
.. _whatsnew_0200.enhancements.table_schema: | ||
|
||
Table Schema Output | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
@@ -184,6 +184,30 @@ You must enable this by setting the ``display.html.table_schema`` option to True | |
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/ | ||
.. _nteract: http://nteract.io/ | ||
|
||
.. _whatsnew_0200.enhancements.scipy_sparse: | ||
|
||
SciPy sparse matrix from/to SparseDataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a ref to the new docs you created |
||
Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances. See the :ref:`documentation <sparse.scipysparse>` for more information. (:issue:`4343`) | ||
|
||
All sparse formats are supported, but matrices that aren't in :mod:`COOrdinate <scipy.sparse>` format will be converted to it, copying the data as needed. | ||
|
||
.. ipython:: python | ||
|
||
from scipy.sparse import csr_matrix | ||
arr = np.random.random(size=(1000, 5)) | ||
arr[arr < .9] = 0 | ||
sp_arr = csr_matrix(arr) | ||
sp_arr | ||
sdf = pd.SparseDataFrame(sp_arr) | ||
sdf | ||
|
||
To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use: | ||
|
||
.. ipython:: python | ||
|
||
sdf.to_coo() | ||
|
||
.. _whatsnew_0200.enhancements.other: | ||
|
||
Other enhancements | ||
|
@@ -284,7 +308,7 @@ Using ``.iloc``. Here we will get the location of the 'A' column, then use *posi | |
df.iloc[[0, 2], df.columns.get_loc('A')] | ||
|
||
|
||
.. _whatsnew.api_breaking.io_compat | ||
.. _whatsnew.api_breaking.io_compat: | ||
|
||
Possible incompat for HDF5 formats for pandas < 0.13.0 | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
@@ -705,7 +729,7 @@ Bug Fixes | |
|
||
- Bug in the display of ``.info()`` where a qualifier (+) would always be displayed with a ``MultiIndex`` that contains only non-strings (:issue:`15245`) | ||
|
||
- Bug in ``.asfreq()``, where frequency was not set for empty ``Series` (:issue:`14320`) | ||
- Bug in ``.asfreq()``, where frequency was not set for empty ``Series`` (:issue:`14320`) | ||
|
||
- Bug in ``pd.read_msgpack()`` in which ``Series`` categoricals were being improperly processed (:issue:`14901`) | ||
- Bug in ``Series.ffill()`` with mixed dtypes containing tz-aware datetimes. (:issue:`14956`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,7 @@ | |
is_integer_dtype, | ||
is_bool_dtype, | ||
is_list_like, | ||
is_string_dtype, | ||
is_scalar, is_dtype_equal) | ||
from pandas.types.cast import (_possibly_convert_platform, _maybe_promote, | ||
_astype_nansafe, _find_common_type) | ||
|
@@ -769,14 +770,20 @@ def make_sparse(arr, kind='block', fill_value=None): | |
if isnull(fill_value): | ||
mask = notnull(arr) | ||
else: | ||
# For str arrays in NumPy 1.12.0, operator!= below isn't | ||
# element-wise but just returns False if fill_value is not str, | ||
# so cast to object comparison to be safe | ||
if is_string_dtype(arr): | ||
arr = arr.astype(object) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps the conversion should happen sooner, but certainly no later. Given the dtype-preserving behavior in SparseArray constructor, I doubt any sooner either. |
||
|
||
mask = arr != fill_value | ||
|
||
length = len(arr) | ||
if length != mask.size: | ||
# the arr is a SparseArray | ||
indices = mask.sp_index.indices | ||
else: | ||
indices = np.arange(length, dtype=np.int32)[mask] | ||
indices = mask.nonzero()[0].astype(np.int32) | ||
|
||
index = _make_index(length, indices, kind) | ||
sparsified_values = arr[mask] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,8 +11,8 @@ | |
import numpy as np | ||
|
||
from pandas.types.missing import isnull, notnull | ||
from pandas.types.cast import _maybe_upcast | ||
from pandas.types.common import _ensure_platform_int | ||
from pandas.types.cast import _maybe_upcast, _find_common_type | ||
from pandas.types.common import _ensure_platform_int, is_scipy_sparse | ||
|
||
from pandas.core.common import _try_sort | ||
from pandas.compat.numpy import function as nv | ||
|
@@ -25,6 +25,7 @@ | |
create_block_manager_from_arrays) | ||
import pandas.core.generic as generic | ||
from pandas.sparse.series import SparseSeries, SparseArray | ||
from pandas.sparse.libsparse import BlockIndex, get_blocks | ||
from pandas.util.decorators import Appender | ||
import pandas.core.ops as ops | ||
|
||
|
@@ -39,15 +40,15 @@ class SparseDataFrame(DataFrame): | |
|
||
Parameters | ||
---------- | ||
data : same types as can be passed to DataFrame | ||
data : same types as can be passed to DataFrame or scipy.sparse.spmatrix | ||
index : array-like, optional | ||
column : array-like, optional | ||
default_kind : {'block', 'integer'}, default 'block' | ||
Default sparse kind for converting Series to SparseSeries. Will not | ||
override SparseSeries passed into constructor | ||
default_fill_value : float | ||
Default fill_value for converting Series to SparseSeries. Will not | ||
override SparseSeries passed in | ||
Default fill_value for converting Series to SparseSeries | ||
(default: nan). Will not override SparseSeries passed in. | ||
""" | ||
_constructor_sliced = SparseSeries | ||
_subtyp = 'sparse_frame' | ||
|
@@ -84,22 +85,19 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None, | |
self._default_kind = default_kind | ||
self._default_fill_value = default_fill_value | ||
|
||
if isinstance(data, dict): | ||
mgr = self._init_dict(data, index, columns) | ||
if dtype is not None: | ||
mgr = mgr.astype(dtype) | ||
if is_scipy_sparse(data): | ||
mgr = self._init_spmatrix(data, index, columns, dtype=dtype, | ||
fill_value=default_fill_value) | ||
elif isinstance(data, dict): | ||
mgr = self._init_dict(data, index, columns, dtype=dtype) | ||
elif isinstance(data, (np.ndarray, list)): | ||
mgr = self._init_matrix(data, index, columns) | ||
if dtype is not None: | ||
mgr = mgr.astype(dtype) | ||
mgr = self._init_matrix(data, index, columns, dtype=dtype) | ||
elif isinstance(data, SparseDataFrame): | ||
mgr = self._init_mgr(data._data, | ||
dict(index=index, columns=columns), | ||
dtype=dtype, copy=copy) | ||
elif isinstance(data, DataFrame): | ||
mgr = self._init_dict(data, data.index, data.columns) | ||
if dtype is not None: | ||
mgr = mgr.astype(dtype) | ||
mgr = self._init_dict(data, data.index, data.columns, dtype=dtype) | ||
elif isinstance(data, BlockManager): | ||
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns), | ||
dtype=dtype, copy=copy) | ||
|
@@ -174,7 +172,43 @@ def _init_dict(self, data, index, columns, dtype=None): | |
return to_manager(sdict, columns, index) | ||
|
||
def _init_matrix(self, data, index, columns, dtype=None): | ||
""" Init self from ndarray or list of lists """ | ||
data = _prep_ndarray(data, copy=False) | ||
index, columns = self._prep_index(data, index, columns) | ||
data = dict([(idx, data[:, i]) for i, idx in enumerate(columns)]) | ||
return self._init_dict(data, index, columns, dtype) | ||
|
||
def _init_spmatrix(self, data, index, columns, dtype=None, | ||
fill_value=None): | ||
""" Init self from scipy.sparse matrix """ | ||
index, columns = self._prep_index(data, index, columns) | ||
data = data.tocoo() | ||
N = len(index) | ||
|
||
# Construct a dict of SparseSeries | ||
sdict = {} | ||
values = Series(data.data, index=data.row, copy=False) | ||
for col, rowvals in values.groupby(data.col): | ||
# get_blocks expects int32 row indices in sorted order | ||
rows = rowvals.index.values.astype(np.int32) | ||
rows.sort() | ||
blocs, blens = get_blocks(rows) | ||
|
||
sdict[columns[col]] = SparseSeries( | ||
rowvals.values, index=index, | ||
fill_value=fill_value, | ||
sparse_index=BlockIndex(N, blocs, blens)) | ||
|
||
# Add any columns that were empty and thus not grouped on above | ||
sdict.update({column: SparseSeries(index=index, | ||
fill_value=fill_value, | ||
sparse_index=BlockIndex(N, [], [])) | ||
for column in columns | ||
if column not in sdict}) | ||
|
||
return self._init_dict(sdict, index, columns, dtype) | ||
|
||
def _prep_index(self, data, index, columns): | ||
N, K = data.shape | ||
if index is None: | ||
index = _default_index(N) | ||
|
@@ -187,9 +221,48 @@ def _init_matrix(self, data, index, columns, dtype=None): | |
if len(index) != N: | ||
raise ValueError('Index length mismatch: %d vs. %d' % | ||
(len(index), N)) | ||
return index, columns | ||
|
||
data = dict([(idx, data[:, i]) for i, idx in enumerate(columns)]) | ||
return self._init_dict(data, index, columns, dtype) | ||
def to_coo(self): | ||
""" | ||
Return the contents of the frame as a sparse SciPy COO matrix. | ||
|
||
.. versionadded:: 0.20.0 | ||
|
||
Returns | ||
------- | ||
coo_matrix : scipy.sparse.spmatrix | ||
If the caller is heterogeneous and contains booleans or objects, | ||
the result will be of dtype=object. See Notes. | ||
|
||
Notes | ||
----- | ||
The dtype will be the lowest-common-denominator type (implicit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mention that this is efficient (is it? does it copy? and if so what), mainly nice to have docs about this (and could refer main docs here as well). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As efficient as What do you mean by referring main docs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can put a link in the doc-string itself to something in the main docs (a url). If you want to point to something more involved (this is an option). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dunno. I just copied these Notes from NDFrame.values, adapting a slight bit. |
||
upcasting); that is to say if the dtypes (even of numeric types) | ||
are mixed, the one that accommodates all will be chosen. | ||
|
||
e.g. If the dtypes are float16 and float32, dtype will be upcast to | ||
float32. By numpy.find_common_type convention, mixing int64 and | ||
and uint64 will result in a float64 dtype. | ||
""" | ||
try: | ||
from scipy.sparse import coo_matrix | ||
except ImportError: | ||
raise ImportError('Scipy is not installed') | ||
|
||
dtype = _find_common_type(self.dtypes) | ||
cols, rows, datas = [], [], [] | ||
for col, name in enumerate(self): | ||
s = self[name] | ||
row = s.sp_index.to_int_index().indices | ||
cols.append(np.repeat(col, len(row))) | ||
rows.append(row) | ||
datas.append(s.sp_values.astype(dtype, copy=False)) | ||
|
||
cols = np.concatenate(cols) | ||
rows = np.concatenate(rows) | ||
datas = np.concatenate(datas) | ||
return coo_matrix((datas, (rows, cols)), shape=self.shape) | ||
|
||
def __array_wrap__(self, result): | ||
return self._constructor( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
import pytest | ||
|
||
import pandas.util.testing as tm | ||
|
||
|
||
@pytest.fixture(params=['bsr', 'coo', 'csc', 'csr', 'dia', 'dok', 'lil']) | ||
def spmatrix(request): | ||
tm._skip_if_no_scipy() | ||
from scipy import sparse | ||
return getattr(sparse, request.param + '_matrix') |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,11 +2,17 @@ | |
|
||
import operator | ||
|
||
import pytest | ||
|
||
from numpy import nan | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from pandas import Series, DataFrame, bdate_range, Panel | ||
from pandas.types.common import (is_bool_dtype, | ||
is_float_dtype, | ||
is_object_dtype, | ||
is_float) | ||
from pandas.tseries.index import DatetimeIndex | ||
from pandas.tseries.offsets import BDay | ||
import pandas.util.testing as tm | ||
|
@@ -18,6 +24,8 @@ | |
from pandas.sparse.api import SparseSeries, SparseDataFrame, SparseArray | ||
from pandas.tests.frame.test_misc_api import SharedWithSparse | ||
|
||
from pandas.tests.sparse.common import spmatrix # noqa: F401 | ||
|
||
|
||
class TestSparseDataFrame(tm.TestCase, SharedWithSparse): | ||
|
||
|
@@ -1118,6 +1126,60 @@ def test_isnotnull(self): | |
tm.assert_frame_equal(res.to_dense(), exp) | ||
|
||
|
||
@pytest.mark.parametrize('index', [None, list('ab')]) # noqa: F811 | ||
@pytest.mark.parametrize('columns', [None, list('cd')]) | ||
@pytest.mark.parametrize('fill_value', [None, 0, np.nan]) | ||
@pytest.mark.parametrize('dtype', [object, bool, int, float, np.uint16]) | ||
def test_from_to_scipy(spmatrix, index, columns, fill_value, dtype): | ||
# GH 4343 | ||
tm._skip_if_no_scipy() | ||
|
||
# Make one ndarray and from it one sparse matrix, both to be used for | ||
# constructing frames and comparing results | ||
arr = np.eye(2, dtype=dtype) | ||
try: | ||
spm = spmatrix(arr) | ||
assert spm.dtype == arr.dtype | ||
except (TypeError, AssertionError): | ||
# If conversion to sparse fails for this spmatrix type and arr.dtype, | ||
# then the combination is not currently supported in NumPy, so we | ||
# can just skip testing it thoroughly | ||
return | ||
|
||
sdf = pd.SparseDataFrame(spm, index=index, columns=columns, | ||
default_fill_value=fill_value) | ||
|
||
# Expected result construction is kind of tricky for all | ||
# dtype-fill_value combinations; easiest to cast to something generic | ||
# and except later on | ||
rarr = arr.astype(object) | ||
rarr[arr == 0] = np.nan | ||
expected = pd.SparseDataFrame(rarr, index=index, columns=columns).fillna( | ||
fill_value if fill_value is not None else np.nan) | ||
|
||
# Assert frame is as expected | ||
sdf_obj = sdf.astype(object) | ||
tm.assert_sp_frame_equal(sdf_obj, expected) | ||
tm.assert_frame_equal(sdf_obj.to_dense(), expected.to_dense()) | ||
|
||
# Assert spmatrices equal | ||
tm.assert_equal(dict(sdf.to_coo().todok()), dict(spm.todok())) | ||
|
||
# Ensure dtype is preserved if possible | ||
was_upcast = ((fill_value is None or is_float(fill_value)) and | ||
not is_object_dtype(dtype) and | ||
not is_float_dtype(dtype)) | ||
res_dtype = (bool if is_bool_dtype(dtype) else | ||
float if was_upcast else | ||
dtype) | ||
tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)}) | ||
tm.assert_equal(sdf.to_coo().dtype, res_dtype) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. k this seems reasonable |
||
# However, adding a str column results in an upcast to object | ||
sdf['strings'] = np.arange(len(sdf)).astype(str) | ||
tm.assert_equal(sdf.to_coo().dtype, np.object_) | ||
|
||
|
||
class TestSparseDataFrameArithmetic(tm.TestCase): | ||
|
||
def test_numeric_op_scalar(self): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, versionadded parts are appended at the end, but I thought this prominent, hopefully non-experimental feature should be exposed at the top ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that's fine, just mention, starting in 0.20.0 (or can use a versionadded tag)