BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 · 2024-10-23T14:28:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.read_excel("./preserve_data.xlsx")

Issue Description

Input data

Datatype is preserved, but values are modified

TRUE -> 1

values are modified, even if i read as string

TRUE <-> 1

additionally, I want TRUE in uppercase, if it is changed to True, i can't find difference if user has 'True in a cell

Expected Behavior

There should be a way to preserve values and datatypes as it is

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252

pandas : 2.2.2
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.23
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

ZKaoChi · 2024-10-30T10:47:53Z

I think this is worth changing, can I take it ?

ZKaoChi · 2024-10-31T08:15:40Z

Hello, I tested excel and found that as long as there is only true in the table, it will be converted to TRUE, which is a keyword in excel. So maybe user will not have True in a cell.

ZKaoChi · 2024-10-31T08:19:37Z

But TRUE is parsed as True in the first column and 1 in the back, which I think is worth changing.Maybe we should make the default values the same for different columns

ZKaoChi · 2024-10-31T10:40:57Z

I created a excel to test it. The data is:

      a     b     c     d     e     f     g  h     i
0  True     1  True  True     1     1  True  1  True
1  True  True     1  True     1  True     1  1  True
2  True  True  True     1  True     1     1  1  True

If we read it directly, the result is：

>>> pd.read_excel("./123.xlsx")
      a  b  c  d  e  f  g  h     i
0  True  1  1  1  1  1  1  1  True
1  True  1  1  1  1  1  1  1  True
2  True  1  1  1  1  1  1  1  True

If we read it with dtype = str, the result is:

>>> pd.read_excel("./123.xlsx",dtype=str)
      a  b     c     d  e  f     g  h     i
0  True  1  True  True  1  1  True  1  True
1  True  1  True  True  1  1  True  1  True
2  True  1  True  True  1  1  True  1  True

I think it has something to do with the way the data is created.

rhshadrach · 2024-11-02T12:54:43Z

As the docstring states, use dtype=object if you do not want pandas to do any inference on the dtype.

I checked both calamine and openpyxl, both readers are reading integer or Boolean values instead of e.g. TRUE. You can see this for openpyxl with:

from openpyxl import load_workbook
wb = load_workbook('test.xlsx', data_only=True)
for row in wb.worksheets[0].rows:
    for cell in row:
        print(cell, cell.value, cell.internal_value, cell.data_type)

As pandas only gets values through third-party libraries, they would need to support this first. It is likely there is a technical limitation in the Excel spec that prevents this, but I'm not certain.

As there is nothing pandas can do here, closing this issue.

vignesh14052002 · 2025-03-12T15:18:23Z

@rhshadrach, I am getting expecting results in openpyxl, the issue is with pandas, please look into it

I tried debugging and found the place where conversion is happening, it is the _infer_types from ParserBase , here is the callstack

eventhough if i pass dtype="object" , in column B [True, 1] is being converted to [1,1] at this line

when i execute parsers.sanitize_objects(values, na_values), it is modifying the memory of values array

this is post execution state

If i put values.copy() i get expected results

can you confirm, this is the right fix?

rhshadrach · 2025-03-26T15:48:05Z

@vignesh14052002 - it would be helpful if you provided a reproducible example, e.g.

df = pd.DataFrame({"a": ["TRUE", "1"], "b": ["TRUE", "TRUE"], "c": ["1", "TRUE"]})
df.to_excel("test.xlsx")

Does that file reproduce the issue for you?

Also, it would be helpful if instead of posting screenshots, you used plaintext for code and permalinks for when you want to reference pandas code.

As far as I can tell so far, the code you are changing is not hit.

vignesh14052002 · 2025-03-27T09:50:59Z

@rhshadrach here is a reproducible code example

To generate input data

import pandas as pd

df = pd.DataFrame({"a":[True,True],"b":[1,True],"c":[True,1]})
df.to_excel("./preserve_data.xlsx",index=False)

reading with pandas (b and c column will be modified)

pd.read_excel("./preserve_data.xlsx")

reading with openpyxl (expected results)

import openpyxl

workbook = openpyxl.load_workbook("./preserve_data.xlsx")

for row in workbook.active.iter_rows(values_only=True):
    print(row)

rhshadrach · 2025-04-06T12:29:33Z

Thanks @vignesh14052002 - it seems to me the issue is in pandas.io.parsers.base_parser.ParserBase._convert_to_ndarrays. There we call self._infer_types even when dtype="object". It seems like we should not be inferring the types at all in this case. However, this parser is used beyond Excel, I haven't yet looked into if this change would break anything else. Further investigations and PRs to fix are welcome.

Putting values.copy() prior to calling sanitize_objects breaks 37 tests for me; this is likely not the right approach.

vignesh14052002 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 23, 2024

asishm added the IO Excel read_excel, to_excel label Oct 23, 2024

rhshadrach closed this as completed Nov 2, 2024

rhshadrach added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2024

rhshadrach reopened this Mar 19, 2025

rhshadrach removed the Upstream issue Issue related to pandas dependency label Apr 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Datatypes not preserved on pd.read_excel #60088

BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 commented Oct 23, 2024

INSTALLED VERSIONS

ZKaoChi commented Oct 30, 2024

ZKaoChi commented Oct 31, 2024 •

edited

Loading

ZKaoChi commented Oct 31, 2024

ZKaoChi commented Oct 31, 2024

rhshadrach commented Nov 2, 2024

vignesh14052002 commented Mar 12, 2025

rhshadrach commented Mar 26, 2025 •

edited

Loading

vignesh14052002 commented Mar 27, 2025

rhshadrach commented Apr 6, 2025

BUG: Datatypes not preserved on pd.read_excel #60088

BUG: Datatypes not preserved on pd.read_excel #60088

Comments

vignesh14052002 commented Oct 23, 2024

Pandas version checks

Reproducible Example

Issue Description

Datatype is preserved, but values are modified

values are modified, even if i read as string

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ZKaoChi commented Oct 30, 2024

ZKaoChi commented Oct 31, 2024 • edited Loading

ZKaoChi commented Oct 31, 2024

ZKaoChi commented Oct 31, 2024

rhshadrach commented Nov 2, 2024

vignesh14052002 commented Mar 12, 2025

rhshadrach commented Mar 26, 2025 • edited Loading

vignesh14052002 commented Mar 27, 2025

rhshadrach commented Apr 6, 2025

ZKaoChi commented Oct 31, 2024 •

edited

Loading

rhshadrach commented Mar 26, 2025 •

edited

Loading