Skip to content

cross section coercion with output iterating #12859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tsu-shiuan opened this issue Apr 11, 2016 · 11 comments · Fixed by #41431
Closed

cross section coercion with output iterating #12859

tsu-shiuan opened this issue Apr 11, 2016 · 11 comments · Fixed by #41431
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@tsu-shiuan
Copy link

I'm am trying to call the to_dict function on the following DataFrame:

import pandas as pd

data = {"a": [1,2,3,4,5], "b": [90,80,40,60,30]}

df = pd.DataFrame(data)
df
   a   b
0  1  90
1  2  80
2  3  40
3  4  60
4  5  30
df.reset_index().to_dict("r")
[{'a': 1, 'b': 90, 'index': 0},
 {'a': 2, 'b': 80, 'index': 1},
 {'a': 3, 'b': 40, 'index': 2},
 {'a': 4, 'b': 60, 'index': 3},
 {'a': 5, 'b': 30, 'index': 4}]

However my problem occurs if I perform a float operation on the dataframe, which mutates the index into a float:

(df*1.0).reset_index().to_dict("r")
[{'a': 1.0, 'b': 90.0, 'index': 0.0},  
{'a': 2.0, 'b': 80.0, 'index': 1.0},  
{'a': 3.0, 'b': 40.0, 'index': 2.0},  
{'a': 4.0, 'b': 60.0, 'index': 3.0},  
{'a': 5.0, 'b': 30.0, 'index': 4.0}]

Can anyone explain the above behaviour or recommend a workaround, or verify whether or not this could be a pandas bug? None of the other outtypes in the to_dict method mutates the index as shown above.

I've replicated this on both pandas 0.14 and 0.18 (latest)

Many thanks!

link to stackoverflow: http://stackoverflow.com/questions/36548151/pandas-to-dict-changes-index-type-with-outtype-records

@TomAugspurger
Copy link
Contributor

Nothing to do with the index, just the fact that you have any float dtypes in the data

data = {"a": [1.0,2,3,4,5], "b": [90,80,40,60,30]}
In [19]: df.to_dict("records")
Out[19]:
[{'a': 1.0, 'b': 90.0},
 {'a': 2.0, 'b': 80.0},
 {'a': 3.0, 'b': 40.0},
 {'a': 4.0, 'b': 60.0},
 {'a': 5.0, 'b': 30.0}]

If you look at the code, we use DataFrame.values, which returns a NumPy array, which must have a single dtype (float64 in this case).

We probably don't need to use .values here.

@TomAugspurger TomAugspurger added Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions labels Apr 11, 2016
@TomAugspurger TomAugspurger added this to the 0.19.0 milestone Apr 11, 2016
@tsu-shiuan
Copy link
Author

Thanks for your response. It there a possible workaround that I can use in the meantime?

@TomAugspurger
Copy link
Contributor

Something like

In [28]: [x._asdict() for x in df.itertuples()]
Out[28]:
[OrderedDict([('Index', 0), ('a', 1.0), ('b', 90)]),
 OrderedDict([('Index', 1), ('a', 2.0), ('b', 80)]),
 OrderedDict([('Index', 2), ('a', 3.0), ('b', 40)]),
 OrderedDict([('Index', 3), ('a', 4.0), ('b', 60)]),
 OrderedDict([('Index', 4), ('a', 5.0), ('b', 30)])]

That's an OrderedDict using namedtuple._asdict, you can write dict comprehension if you want a regular one.

@jreback jreback modified the milestones: No action, 0.19.0 Apr 11, 2016
@tsu-shiuan
Copy link
Author

Thanks :)

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

Though one could argue that this result is correct as we don't support mixed types in int-float when doing a cross-section, IOW:

In [10]: (df*1.0).reset_index().iloc[1]
Out[10]: 
index     1.0
a         2.0
b        80.0
Name: 1, dtype: float64

this is somewhat related to #12532, meaning that we should be iterating directly over (which already does the proper coercion), rather that doing a specific coercion in .to_dict().

@jreback jreback changed the title to_dict with outtype='records' mutates the index type cross section coercion with output iterating Apr 11, 2016
@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

giong to mark this an an API issue that needs discussion. This would actually be a fairly large change to correctly change this (though to be honest I think the current behavior is fine).

@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action labels Apr 11, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@makmanalp
Copy link
Contributor

makmanalp commented May 25, 2017

Note for future seekers - I'm trying to combine multiple pandas objects into one nested json structure.

Since to_json doesn't work in this case (manipulating json strings is hard), you might try to do to_dict(orient="records"), and combine the results of the to_dict()s into a bigger object, and do json.dumps on that. But because of this bug, you can't do that without screwing with the types of everything.

So then you might try doing @TomAugspurger's solution but you might find that for some reason it won't convert numpy types to python types, like to_dict() does, which makes json.dumps() fail.

My workaround solution is to do to_json() which gives you a correct json string with correct types, then do json.loads() on that to get python objects corresponding to that string, which you then put together whichever way you want (e.g. big_obj = {"a": df_a_json, "b": df_b_json}) and then run json.dumps on the whole thing. It's roundabout but it's the closest general solution I found without having to muck about with type conversions myself!

def to_records(df):
    """Replacement for pandas' to_dict(orient="records") which has trouble with
    upcasting ints to floats in the case of other floats being there.

    https://github.com/pandas-dev/pandas/issues/12859
    """
    import json
    return json.loads(df.to_json(orient="records"))

@TomAugspurger
Copy link
Contributor

👍 There is an issue somewhere .to_dict using python types.

@makmanalp
Copy link
Contributor

Seems like #16048 and #13258

@gosuto-inzasheru
Copy link

Really good workaround is found here: https://stackoverflow.com/a/31375241/1838257

df = pd.DataFrame({'INTS': [1, 2, 3], 'FLOATS': [1., 2., 3.]})

df.iloc[0].to_dict()
{'INTS': 1.0, 'FLOATS': 1.0}

Using the workaround:

df.astype('object').iloc[0].to_dict()
{'INTS': 1, 'FLOATS': 1.0}

Could this be implemented in a flag of .to_dict maybe?

@mroeschke
Copy link
Member

The original example has the correct behavior of the index values remaining as integer. Could use a test

In [26]: (df*1.0).reset_index().to_dict("records")
Out[26]:
[{'index': 0, 'a': 1.0, 'b': 90.0},
 {'index': 1, 'a': 2.0, 'b': 80.0},
 {'index': 2, 'a': 3.0, 'b': 40.0},
 {'index': 3, 'a': 4.0, 'b': 60.0},
 {'index': 4, 'a': 5.0, 'b': 30.0}]

@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed API Design Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action labels Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants