-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
>>> df = pd.DataFrame({'column': [0.0, 1.0, 2.0]})
>>> df.dtypes
column float64
dtype: object
>>> df.convert_dtypes()
column
0 0
1 1
2 2
>>> df.convert_dtypes().dtypes
column Int64
dtype: object I can confirm that the issue is reproducible and not intended behavior. When creating a dataframe that has the same data as newdf, the intended behavior is shown above |
It seems like |
Thanks for the issue @caballerofelipe but this is the expected behavior of I believe the functionality you're expecting is in |
@mroeschke don't you think the doc is incorrect though? |
I guess "best possible" is a bit too subjective so I wouldn't say incorrect as opposed to unclear. A doc improvement to change "best possible" to "convert a numpy type to a type that supports pd.NA" would probably be better |
I believe if I can use Int64 instead of Float64 is "best" (when I don't need a decimal number), for instance from the point of view of legibility it's easier to read an int than to read a number with a point and a zero (without doing some formatting). Also the maximum possible numbers are bigger. Is there a processing reason for not changing from Float64 to Int64, is it expensive some how? (No rhetorical question here, I don't know the answer) Also, is it more expensive than going from float64 (lower F) to Int64 (capital I)? Also, maybe the function could have a parameter to make it do what I thought it was going to do? |
So I found a workaround for what I want. Allow Pandas to change to int64 when no decimals are present. In Step 6 (in the original post), instead of doing Full Exampledf = pd.DataFrame({'column': [0.0, 1.0, 2.0, 3.3]})
df = df.convert_dtypes()
print(df.dtypes)
# Returns
# column Float64
# dtype: object
newdf = df.iloc[:-1]
print(newdf)
# Returns
# column
# 0 0.0
# 1 1.0
# 2 2.0
newdf_convert = newdf.convert_dtypes()
print(newdf_convert.dtypes)
print(newdf_convert)
# Returns
# column Float64
# dtype: object
# column
# 0 0.0
# 1 1.0
# 2 2.0
newdf_astype_convert = newdf.astype('object').convert_dtypes()
print(newdf_astype_convert.dtypes)
print(newdf_astype_convert)
# Returns
# column Int64
# dtype: object
# column
# 0 0
# 1 1
# 2 2
# You could also use a more complex way to obtain int64 (lower i) or float64 (lower f)
newdf_astype_convert_int64 = (
newdf
.astype('object')
.convert_dtypes() # To dtype with pd.NA
.astype('object')
.replace(pd.NA, float('nan')) # Remove pd.NA created before
.infer_objects()
)
print(newdf_astype_convert_int64.dtypes)
print(newdf_astype_convert_int64)
# Returns
# column int64
# dtype: object
# column
# 0 0
# 1 1
# 2 2 The function convert_dtypes could have a parameter 'simplify_dtypes' (or maybe something a correct keyword that I haven't thought about) that would do the same thing without much implemetation effort: Also, you could use this to simplify "even further" to int64 (lower i) or float64 (lower f), see the full example. You would do: Edit: Added |
- In function `_cols_operation_balance_by_instrument_for_group` changed `prev_operation_balance[<colname>]` for `df.loc[prev_idx, <colname>]` as this is easier to understand, it shows that we are accessing the previous index value. - Implemented the usage of `with pd.option_context('future.no_silent_downcasting', True):` for `.fillna()` to avoid unexpected downcasting. See pandas-dev/pandas#57734 (comment) . Used throughout `cols_operation*` functions. - Removed usage of `DataFrame.convert_dtypes()` as it doesn't simplify dtypes, it only passes to a dtype that supports pd.NA. See pandas-dev/pandas#58543 . - Added `DataFrame.infer_objects()` when returning the ledger or `cols_operation*` functions to try to avoid objects if possible. - Changed the structure for `cols_operation*` functions: - Added a verification of `self._ledger_df`, if empty the function returns an empty DataFrame with the structure needed. Allows for less computing if empty. - The way the parameter `show_instr_accnt` creates a return with columns ['instrument', 'account'] is structured the same way on all functions. - Simplified how the empty ledger is created in `_create_empty_ledger_df`. - Changes column name 'balance sell profit loss' to 'accumulated sell profit loss'. - Minor code fixes. - Minor formatting fixes.
- Added the kwarg `simplify_dtypes` to the functions `ledger`, `cols_operation`, `cols_operation_cumsum` and `cols_operation_balance_by_instrument`. This allows for dtype simplification (see pandas-dev/pandas#58543 (comment)). - Added some docstrings.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Step 1
Returns
Step 2
Returns
Step 3
Step 4
Returns
Step 5
Returns
Step 6
Returns
Step 7
Returns
Issue Description
When having a column in a DataFrame with decimal numbers and using
convert_dtypes
, the type for that column is correctly transformed from float64 to Float64 (capital F).However, intuitively, when removing the numbers that have a decimal part and running again
convert_dtypes
, this functions should convert to Int64 (capital I) instead of keeping Float64.Expected Behavior
convert_dtypes
should convert from Float64 to Int64 if the numbers in the column don't have a decimal part.Installed Versions
The text was updated successfully, but these errors were encountered: