BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543

caballerofelipe · 2024-05-03T01:16:05Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Step 1

df = pd.DataFrame({'column': [0.0, 1.0, 2.0, 3.3]})
df

Returns

  column
0 0.0
1 1.0
2 2.0
3 3.3

Step 2

df.dtypes

Returns

column    float64
dtype: object

Step 3

df = df.convert_dtypes()

Step 4

df.dtypes

Returns

column    Float64
dtype: object

Step 5

# Select only rows without a decimal part
newdf = df.iloc[:-1]
newdf

Returns

	column
0	0.0
1	1.0
2	2.0

Step 6

newdf.convert_dtypes()

Returns

	column
0	0.0
1	1.0
2	2.0

Step 7

newdf.convert_dtypes().dtypes

Returns

column    Float64
dtype: object

Issue Description

When having a column in a DataFrame with decimal numbers and using convert_dtypes, the type for that column is correctly transformed from float64 to Float64 (capital F).

However, intuitively, when removing the numbers that have a decimal part and running again convert_dtypes, this functions should convert to Int64 (capital I) instead of keeping Float64.

Expected Behavior

convert_dtypes should convert from Float64 to Int64 if the numbers in the column don't have a decimal part.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.11.7.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.4.0
Version               : Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : None.UTF-8

pandas                : 2.2.2
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0
setuptools            : 69.5.1
pip                   : 24.0
Cython                : 3.0.8
pytest                : 8.0.0
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : 3.1.9
lxml.etree            : 5.1.0
html5lib              : 1.1
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.21.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : 3.8.2
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

The text was updated successfully, but these errors were encountered:

Nrezhang · 2024-05-03T03:06:35Z

>>> df = pd.DataFrame({'column': [0.0, 1.0, 2.0]})
>>> df.dtypes
column    float64
dtype: object
>>> df.convert_dtypes()
   column
0       0
1       1
2       2
>>> df.convert_dtypes().dtypes
column    Int64
dtype: object

I can confirm that the issue is reproducible and not intended behavior. When creating a dataframe that has the same data as newdf, the intended behavior is shown above

Aloqeely · 2024-05-03T04:07:31Z

It seems like convert_dtypes does not do any conversion if the existing dtypes are already supporting pd.NA.
This might be intended because originally the point of convert_dtypes was to encourage users to use pandas ExtensionDtypes instead of numpy dtypes, but that conflicts with the documentation: "Convert columns to the best possible dtypes"

mroeschke · 2024-05-03T16:40:40Z

Thanks for the issue @caballerofelipe but this is the expected behavior of convert_dtypes. As mentioned it's only intended to convert to a dtype that supports pd.NA

I believe the functionality you're expecting is in to_numeric(downcast=) so closing https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

Aloqeely · 2024-05-03T17:20:03Z

@mroeschke don't you think the doc is incorrect though?
It says it converts columns to the best possible dtypes that support pd.NA but that is not actually the case, if it was then it should have converted from Float64 to Int64

mroeschke · 2024-05-03T17:33:16Z

I guess "best possible" is a bit too subjective so I wouldn't say incorrect as opposed to unclear. A doc improvement to change "best possible" to "convert a numpy type to a type that supports pd.NA" would probably be better

caballerofelipe · 2024-05-04T02:31:23Z

I believe if I can use Int64 instead of Float64 is "best" (when I don't need a decimal number), for instance from the point of view of legibility it's easier to read an int than to read a number with a point and a zero (without doing some formatting). Also the maximum possible numbers are bigger.

Is there a processing reason for not changing from Float64 to Int64, is it expensive some how? (No rhetorical question here, I don't know the answer)

Also, is it more expensive than going from float64 (lower F) to Int64 (capital I)?

Also, maybe the function could have a parameter to make it do what I thought it was going to do?

caballerofelipe · 2024-05-08T19:02:36Z

So I found a workaround for what I want. Allow Pandas to change to int64 when no decimals are present.

In Step 6 (in the original post), instead of doing newdf.convert_dtypes(), to force a simpler dtype you can do newdf.astype('object').convert_dtypes(), it's one more step than I would have liked but it works.

Full Example

df = pd.DataFrame({'column': [0.0, 1.0, 2.0, 3.3]})
df = df.convert_dtypes()
print(df.dtypes)
# Returns
# column    Float64
# dtype: object

newdf = df.iloc[:-1]
print(newdf)
# Returns
#    column
# 0     0.0
# 1     1.0
# 2     2.0

newdf_convert = newdf.convert_dtypes()
print(newdf_convert.dtypes)
print(newdf_convert)
# Returns
# column    Float64
# dtype: object
#    column
# 0     0.0
# 1     1.0
# 2     2.0

newdf_astype_convert = newdf.astype('object').convert_dtypes()
print(newdf_astype_convert.dtypes)
print(newdf_astype_convert)
# Returns
# column    Int64
# dtype: object
#    column
# 0       0
# 1       1
# 2       2

# You could also use a more complex way to obtain int64 (lower i) or float64 (lower f)
newdf_astype_convert_int64 = (
    newdf
    .astype('object')
    .convert_dtypes()  # To dtype with pd.NA
    .astype('object')
    .replace(pd.NA, float('nan'))  # Remove pd.NA created before
    .infer_objects()
)
print(newdf_astype_convert_int64.dtypes)
print(newdf_astype_convert_int64)
# Returns
# column    int64
# dtype: object
#    column
# 0       0
# 1       1
# 2       2

The function convert_dtypes could have a parameter 'simplify_dtypes' (or maybe something a correct keyword that I haven't thought about) that would do the same thing without much implemetation effort: convert_dtypes(simplify_dtypes=True) and that would do .astype('object') before the actual conversion.

Also, you could use this to simplify "even further" to int64 (lower i) or float64 (lower f), see the full example. You would do: df.astype('object').convert_dtypes().astype('object').replace(pd.NA, float('nan')).infer_objects(). Although you might want to do this inside a with pd.option_context('future.no_silent_downcasting', True): because of the replace() in there (see this issue).

Edit: Added .replace(pd.NA, float('nan')) in the example to allow conversion to float64 when a nan is present.

- In function `_cols_operation_balance_by_instrument_for_group` changed `prev_operation_balance[<colname>]` for `df.loc[prev_idx, <colname>]` as this is easier to understand, it shows that we are accessing the previous index value. - Implemented the usage of `with pd.option_context('future.no_silent_downcasting', True):` for `.fillna()` to avoid unexpected downcasting. See pandas-dev/pandas#57734 (comment) . Used throughout `cols_operation*` functions. - Removed usage of `DataFrame.convert_dtypes()` as it doesn't simplify dtypes, it only passes to a dtype that supports pd.NA. See pandas-dev/pandas#58543 . - Added `DataFrame.infer_objects()` when returning the ledger or `cols_operation*` functions to try to avoid objects if possible. - Changed the structure for `cols_operation*` functions: - Added a verification of `self._ledger_df`, if empty the function returns an empty DataFrame with the structure needed. Allows for less computing if empty. - The way the parameter `show_instr_accnt` creates a return with columns ['instrument', 'account'] is structured the same way on all functions. - Simplified how the empty ledger is created in `_create_empty_ledger_df`. - Changes column name 'balance sell profit loss' to 'accumulated sell profit loss'. - Minor code fixes. - Minor formatting fixes.

- Added the kwarg `simplify_dtypes` to the functions `ledger`, `cols_operation`, `cols_operation_cumsum` and `cols_operation_balance_by_instrument`. This allows for dtype simplification (see pandas-dev/pandas#58543 (comment)). - Added some docstrings.

caballerofelipe added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2024

mroeschke closed this as completed May 3, 2024

Aloqeely mentioned this issue May 3, 2024

DOC: Improve convert_dtypes's docstring #58558

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543

BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543

caballerofelipe commented May 3, 2024 •

edited

Loading

Nrezhang commented May 3, 2024

Aloqeely commented May 3, 2024

mroeschke commented May 3, 2024

Aloqeely commented May 3, 2024

mroeschke commented May 3, 2024

caballerofelipe commented May 4, 2024 •

edited

Loading

caballerofelipe commented May 8, 2024 •

edited

Loading

BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543

BUG: convert_dtypes() doesn't convert after a previous conversion was done #58543

Comments

caballerofelipe commented May 3, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Nrezhang commented May 3, 2024

Aloqeely commented May 3, 2024

mroeschke commented May 3, 2024

Aloqeely commented May 3, 2024

mroeschke commented May 3, 2024

caballerofelipe commented May 4, 2024 • edited Loading

caballerofelipe commented May 8, 2024 • edited Loading

caballerofelipe commented May 3, 2024 •

edited

Loading

caballerofelipe commented May 4, 2024 •

edited

Loading

caballerofelipe commented May 8, 2024 •

edited

Loading