-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: incorrect handling of scipy.sparse.dok formats #16191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16191 +/- ##
==========================================
- Coverage 90.38% 90.37% -0.02%
==========================================
Files 167 161 -6
Lines 50872 50863 -9
==========================================
- Hits 45982 45968 -14
- Misses 4890 4895 +5
Continue to review full report at Codecov.
|
pandas/core/sparse/frame.py
Outdated
@@ -191,11 +191,13 @@ def _init_spmatrix(self, data, index, columns, dtype=None, | |||
for col, rowvals in values.groupby(data.col): | |||
# get_blocks expects int32 row indices in sorted order | |||
rows = rowvals.index.values.astype(np.int32) | |||
vals = np.array([y for x, y in sorted(rowvals.iteritems())], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not right, it needs to be in the order of the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which index do you mean? This code should sort the values by the index of rowvals, or am I mistaken?
Sorry, I'm new to the codebase, so I might be misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are passing index below to create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this to be the whole of a correct fix:
@@ -190,8 +190,8 @@ class SparseDataFrame(DataFrame):
values = Series(data.data, index=data.row, copy=False)
for col, rowvals in values.groupby(data.col):
# get_blocks expects int32 row indices in sorted order
+ rowvals.sort_index(inplace=True)
rows = rowvals.index.values.astype(np.int32)
- rows.sort()
blocs, blens = get_blocks(rows)
sdict[columns[col]] = SparseSeries(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kernc
Thanks, this is indeed a faster and more elegant solution. I've committed this solution to this branch.
Just out of curiosity though, this does not seem to be fundamentally different behavior from my original solution, since they both seem to be sorting on the same index of rowvals. Am I mistaken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're exactly right; just somewhat more elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Is this solution still problematic? I think sorting the values by their index ensures that the values are in the correct order that aligns them with the index being passed below.
pandas/core/sparse/frame.py
Outdated
@@ -190,8 +190,8 @@ def _init_spmatrix(self, data, index, columns, dtype=None, | |||
values = Series(data.data, index=data.row, copy=False) | |||
for col, rowvals in values.groupby(data.col): | |||
# get_blocks expects int32 row indices in sorted order | |||
rowvals.sort_index(inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rowvals = rowvals.sort_index()
is idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
bcf41b5
to
b314475
Compare
I added the whatsnew for 0.20.1. pls rebase and add a note in bug fixes. |
20d5b34
to
0ecb2c0
Compare
@jreback |
doc/source/whatsnew/v0.20.1.txt
Outdated
@@ -66,8 +66,7 @@ Groupby/Resample/Rolling | |||
Sparse | |||
^^^^^^ | |||
|
|||
|
|||
|
|||
- Bug in construction of SparseDataFrame from ``scipy.sparse.dok_matrix`` (:issue:`16179`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to 0.20.2
can you add the test from the issue as well (with a comment as to the issue number). ping on green. |
0ecb2c0
to
a2ced79
Compare
@jreback |
your change looks fine. I would still like another test that is the exact replica from the issue as well. otherwise lgtm. ping on green. |
a2ced79
to
6d0c545
Compare
@jreback |
thanks! |
pandas-dev#16191) (cherry picked from commit 1c0b632)
git diff upstream/master --name-only -- '*.py' | flake8 --diff