Skip to content

BUG: incorrect handling of scipy.sparse.dok formats #16191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 11, 2017

Conversation

keitakurita
Copy link
Contributor

@codecov
Copy link

codecov bot commented May 2, 2017

Codecov Report

Merging #16191 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16191      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.02%     
==========================================
  Files         167      161       -6     
  Lines       50872    50863       -9     
==========================================
- Hits        45982    45968      -14     
- Misses       4890     4895       +5
Flag Coverage Δ
#multiple 88.16% <100%> (ø) ⬆️
#single 40.32% <0%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 94.26% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/msgpack/_version.py 44.86% <0%> (-55.14%) ⬇️
pandas/core/util/hashing.py 92.03% <0%> (-7.02%) ⬇️
pandas/core/computation/expressions.py 90.47% <0%> (-2.21%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.23% <0%> (-0.1%) ⬇️
pandas/formats/style.py
pandas/__init__.py
pandas/io/msgpack/__init__.py
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d50f981...6d0c545. Read the comment docs.

@jreback jreback added the Sparse Sparse Data Type label May 2, 2017
@@ -191,11 +191,13 @@ def _init_spmatrix(self, data, index, columns, dtype=None,
for col, rowvals in values.groupby(data.col):
# get_blocks expects int32 row indices in sorted order
rows = rowvals.index.values.astype(np.int32)
vals = np.array([y for x, y in sorted(rowvals.iteritems())],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not right, it needs to be in the order of the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which index do you mean? This code should sort the values by the index of rowvals, or am I mistaken?
Sorry, I'm new to the codebase, so I might be misunderstanding something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are passing index below to create

Copy link
Contributor

@kernc kernc May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this to be the whole of a correct fix:

@@ -190,8 +190,8 @@ class SparseDataFrame(DataFrame):
         values = Series(data.data, index=data.row, copy=False)
         for col, rowvals in values.groupby(data.col):
             # get_blocks expects int32 row indices in sorted order
+            rowvals.sort_index(inplace=True)
             rows = rowvals.index.values.astype(np.int32)
-            rows.sort()
             blocs, blens = get_blocks(rows)
 
             sdict[columns[col]] = SparseSeries(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kernc
Thanks, this is indeed a faster and more elegant solution. I've committed this solution to this branch.
Just out of curiosity though, this does not seem to be fundamentally different behavior from my original solution, since they both seem to be sorting on the same index of rowvals. Am I mistaken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're exactly right; just somewhat more elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Is this solution still problematic? I think sorting the values by their index ensures that the values are in the correct order that aligns them with the index being passed below.

@@ -190,8 +190,8 @@ def _init_spmatrix(self, data, index, columns, dtype=None,
values = Series(data.data, index=data.row, copy=False)
for col, rowvals in values.groupby(data.col):
# get_blocks expects int32 row indices in sorted order
rowvals.sort_index(inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rowvals = rowvals.sort_index() is idiomatic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@jreback jreback added this to the 0.20.1 milestone May 3, 2017
@keitakurita keitakurita force-pushed the sparse_dia_matrix branch from bcf41b5 to b314475 Compare May 3, 2017 02:11
@jreback jreback changed the title BUG: Fixed GH 16179 and modified tests accordingly BUG: incorrect handling of scipy.sparse.dok formats May 3, 2017
@jreback jreback added the Bug label May 3, 2017
@jreback
Copy link
Contributor

jreback commented May 3, 2017

I added the whatsnew for 0.20.1. pls rebase and add a note in bug fixes.

@keitakurita keitakurita force-pushed the sparse_dia_matrix branch 2 times, most recently from 20d5b34 to 0ecb2c0 Compare May 3, 2017 11:00
@keitakurita
Copy link
Contributor Author

@jreback
rebased and notes added

@jreback jreback modified the milestone: 0.20.1 May 5, 2017
@@ -66,8 +66,7 @@ Groupby/Resample/Rolling
Sparse
^^^^^^



- Bug in construction of SparseDataFrame from ``scipy.sparse.dok_matrix`` (:issue:`16179`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.20.2

@jreback jreback modified the milestone: 0.20.2 May 6, 2017
@jreback
Copy link
Contributor

jreback commented May 6, 2017

can you add the test from the issue as well (with a comment as to the issue number). ping on green.

@keitakurita keitakurita force-pushed the sparse_dia_matrix branch from 0ecb2c0 to a2ced79 Compare May 7, 2017 23:52
@keitakurita
Copy link
Contributor Author

@jreback
I've modified the test for the construction of SparseDataFrame from a sparse matrix to actually catch situations where the direct initialization from the array returns a different result. The reason these tests were succeeding in the past was because the array being tested on just happened to return the correct result by chance.
Should I add another, different test for this issue? Or is commenting on this issue where I have changed the tests enough?

@jreback
Copy link
Contributor

jreback commented May 7, 2017

your change looks fine. I would still like another test that is the exact replica from the issue as well. otherwise lgtm. ping on green.

@keitakurita keitakurita force-pushed the sparse_dia_matrix branch from a2ced79 to 6d0c545 Compare May 8, 2017 01:13
@keitakurita
Copy link
Contributor Author

@jreback
Thanks, understood and tests have been added

@jreback jreback added this to the 0.20.2 milestone May 11, 2017
@jreback jreback merged commit 1c0b632 into pandas-dev:master May 11, 2017
@jreback
Copy link
Contributor

jreback commented May 11, 2017

thanks!

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: SparseDataFrame from scipy.sparse.dok_matrix returns incorrect dataframe in python 2.7
5 participants