BUG: incorrect handling of scipy.sparse.dok formats #16191

keitakurita · 2017-05-02T03:06:20Z

closes BUG: SparseDataFrame from scipy.sparse.dok_matrix returns incorrect dataframe in python 2.7 #16179
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff

codecov · 2017-05-02T03:37:11Z

Codecov Report

Merging #16191 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16191      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.02%     
==========================================
  Files         167      161       -6     
  Lines       50872    50863       -9     
==========================================
- Hits        45982    45968      -14     
- Misses       4890     4895       +5

Flag	Coverage Δ
#multiple	`88.16% <100%> (ø)`	⬆️
#single	`40.32% <0%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sparse/frame.py	`94.26% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/io/msgpack/_version.py	`44.86% <0%> (-55.14%)`	⬇️
pandas/core/util/hashing.py	`92.03% <0%> (-7.02%)`	⬇️
pandas/core/computation/expressions.py	`90.47% <0%> (-2.21%)`	⬇️
pandas/core/frame.py	`97.59% <0%> (-0.1%)`	⬇️
pandas/core/indexes/datetimes.py	`95.23% <0%> (-0.1%)`	⬇️
pandas/formats/style.py
pandas/__init__.py
pandas/io/msgpack/__init__.py
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d50f981...6d0c545. Read the comment docs.

jreback · 2017-05-02T10:38:22Z

pandas/core/sparse/frame.py

@@ -191,11 +191,13 @@ def _init_spmatrix(self, data, index, columns, dtype=None,
        for col, rowvals in values.groupby(data.col):
            # get_blocks expects int32 row indices in sorted order
            rows = rowvals.index.values.astype(np.int32)
+            vals = np.array([y for x, y in sorted(rowvals.iteritems())],


this is not right, it needs to be in the order of the index.

Which index do you mean? This code should sort the values by the index of rowvals, or am I mistaken?
Sorry, I'm new to the codebase, so I might be misunderstanding something.

you are passing index below to create

I believe this to be the whole of a correct fix:

@@ -190,8 +190,8 @@ class SparseDataFrame(DataFrame): values = Series(data.data, index=data.row, copy=False) for col, rowvals in values.groupby(data.col): # get_blocks expects int32 row indices in sorted order + rowvals.sort_index(inplace=True) rows = rowvals.index.values.astype(np.int32) - rows.sort() blocs, blens = get_blocks(rows) sdict[columns[col]] = SparseSeries(

@kernc
Thanks, this is indeed a faster and more elegant solution. I've committed this solution to this branch.
Just out of curiosity though, this does not seem to be fundamentally different behavior from my original solution, since they both seem to be sorting on the same index of rowvals. Am I mistaken?

You're exactly right; just somewhat more elegant.

@jreback
Is this solution still problematic? I think sorting the values by their index ensures that the values are in the correct order that aligns them with the index being passed below.

jreback · 2017-05-03T00:52:58Z

pandas/core/sparse/frame.py

@@ -190,8 +190,8 @@ def _init_spmatrix(self, data, index, columns, dtype=None,
        values = Series(data.data, index=data.row, copy=False)
        for col, rowvals in values.groupby(data.col):
            # get_blocks expects int32 row indices in sorted order
+            rowvals.sort_index(inplace=True)


rowvals = rowvals.sort_index() is idiomatic

jreback · 2017-05-03T10:32:42Z

I added the whatsnew for 0.20.1. pls rebase and add a note in bug fixes.

keitakurita · 2017-05-03T11:51:56Z

@jreback
rebased and notes added

jreback · 2017-05-06T23:15:15Z

doc/source/whatsnew/v0.20.1.txt

@@ -66,8 +66,7 @@ Groupby/Resample/Rolling
 Sparse
 ^^^^^^

-
-
+- Bug in construction of SparseDataFrame from ``scipy.sparse.dok_matrix`` (:issue:`16179`)


can you move to 0.20.2

jreback · 2017-05-06T23:16:18Z

can you add the test from the issue as well (with a comment as to the issue number). ping on green.

keitakurita · 2017-05-07T23:55:59Z

@jreback
I've modified the test for the construction of SparseDataFrame from a sparse matrix to actually catch situations where the direct initialization from the array returns a different result. The reason these tests were succeeding in the past was because the array being tested on just happened to return the correct result by chance.
Should I add another, different test for this issue? Or is commenting on this issue where I have changed the tests enough?

jreback · 2017-05-07T23:58:44Z

your change looks fine. I would still like another test that is the exact replica from the issue as well. otherwise lgtm. ping on green.

keitakurita · 2017-05-08T01:15:28Z

@jreback
Thanks, understood and tests have been added

jreback · 2017-05-11T23:00:07Z

thanks!

pandas-dev#16191)

pandas-dev#16191) (cherry picked from commit 1c0b632)

(cherry picked from commit 1c0b632)

pandas-dev#16191)

jreback added the Sparse Sparse Data Type label May 2, 2017

jreback reviewed May 2, 2017

View reviewed changes

jreback reviewed May 3, 2017

View reviewed changes

jreback added this to the 0.20.1 milestone May 3, 2017

keitakurita force-pushed the sparse_dia_matrix branch from bcf41b5 to b314475 Compare May 3, 2017 02:11

jreback changed the title ~~BUG: Fixed GH 16179 and modified tests accordingly~~ BUG: incorrect handling of scipy.sparse.dok formats May 3, 2017

jreback added the Bug label May 3, 2017

keitakurita force-pushed the sparse_dia_matrix branch 2 times, most recently from 20d5b34 to 0ecb2c0 Compare May 3, 2017 11:00

jreback modified the milestone: 0.20.1 May 5, 2017

jreback reviewed May 6, 2017

View reviewed changes

jreback modified the milestone: 0.20.2 May 6, 2017

keitakurita force-pushed the sparse_dia_matrix branch from 0ecb2c0 to a2ced79 Compare May 7, 2017 23:52

BUG: incorrect handling of scipy.sparse.dok formats (pandas-dev#16197)

6d0c545

keitakurita force-pushed the sparse_dia_matrix branch from a2ced79 to 6d0c545 Compare May 8, 2017 01:13

jreback added this to the 0.20.2 milestone May 11, 2017

jreback merged commit 1c0b632 into pandas-dev:master May 11, 2017

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017

BUG: incorrect handling of scipy.sparse.dok formats (pandas-dev#16197) (

2eb631b

pandas-dev#16191)

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017

BUG: incorrect handling of scipy.sparse.dok formats (pandas-dev#16197) (

7b41c7f

pandas-dev#16191) (cherry picked from commit 1c0b632)

TomAugspurger added the Backported label May 30, 2017

TomAugspurger removed the Needs Backport label May 30, 2017

TomAugspurger pushed a commit that referenced this pull request May 30, 2017

BUG: incorrect handling of scipy.sparse.dok formats (#16197) (#16191)

02be419

(cherry picked from commit 1c0b632)

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

BUG: incorrect handling of scipy.sparse.dok formats (pandas-dev#16197) (

3793c31

pandas-dev#16191)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect handling of scipy.sparse.dok formats #16191

BUG: incorrect handling of scipy.sparse.dok formats #16191

keitakurita commented May 2, 2017

codecov bot commented May 2, 2017 •

edited

Loading

jreback May 2, 2017

keitakurita May 2, 2017

jreback May 2, 2017

kernc May 2, 2017 •

edited

Loading

keitakurita May 2, 2017

kernc May 2, 2017

keitakurita May 2, 2017

jreback May 3, 2017

keitakurita May 3, 2017

jreback commented May 3, 2017 •

edited

Loading

keitakurita commented May 3, 2017

jreback May 6, 2017

jreback commented May 6, 2017

keitakurita commented May 7, 2017

jreback commented May 7, 2017

keitakurita commented May 8, 2017

jreback commented May 11, 2017

BUG: incorrect handling of scipy.sparse.dok formats #16191

BUG: incorrect handling of scipy.sparse.dok formats #16191

Conversation

keitakurita commented May 2, 2017

codecov bot commented May 2, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc May 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 3, 2017 • edited Loading

keitakurita commented May 3, 2017

Choose a reason for hiding this comment

jreback commented May 6, 2017

keitakurita commented May 7, 2017

jreback commented May 7, 2017

keitakurita commented May 8, 2017

jreback commented May 11, 2017

codecov bot commented May 2, 2017 •

edited

Loading

kernc May 2, 2017 •

edited

Loading

jreback commented May 3, 2017 •

edited

Loading