Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Datatypes not preserved on pd.read_excel #60088

Open
2 of 3 tasks
vignesh14052002 opened this issue Oct 23, 2024 · 9 comments
Open
2 of 3 tasks

BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 opened this issue Oct 23, 2024 · 9 comments
Labels
Bug IO Excel read_excel, to_excel

Comments

@vignesh14052002
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.read_excel("./preserve_data.xlsx")

Issue Description

Input data
image

Datatype is preserved, but values are modified

TRUE -> 1
image

values are modified, even if i read as string

TRUE <-> 1
image

additionally, I want TRUE in uppercase, if it is changed to True, i can't find difference if user has 'True in a cell

Expected Behavior

There should be a way to preserve values and datatypes as it is

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252

pandas : 2.2.2
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.23
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@vignesh14052002 vignesh14052002 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 23, 2024
@asishm asishm added the IO Excel read_excel, to_excel label Oct 23, 2024
@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 30, 2024

I think this is worth changing, can I take it ?

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

Hello, I tested excel and found that as long as there is only true in the table, it will be converted to TRUE, which is a keyword in excel. So maybe user will not have True in a cell.

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

But TRUE is parsed as True in the first column and 1 in the back, which I think is worth changing.Maybe we should make the default values the same for different columns

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

I created a excel to test it. The data is:

      a     b     c     d     e     f     g  h     i
0  True     1  True  True     1     1  True  1  True
1  True  True     1  True     1  True     1  1  True
2  True  True  True     1  True     1     1  1  True

If we read it directly, the result is:

>>> pd.read_excel("./123.xlsx")
      a  b  c  d  e  f  g  h     i
0  True  1  1  1  1  1  1  1  True
1  True  1  1  1  1  1  1  1  True
2  True  1  1  1  1  1  1  1  True

If we read it with dtype = str, the result is:

>>> pd.read_excel("./123.xlsx",dtype=str)
      a  b     c     d  e  f     g  h     i
0  True  1  True  True  1  1  True  1  True
1  True  1  True  True  1  1  True  1  True
2  True  1  True  True  1  1  True  1  True

I think it has something to do with the way the data is created.

@rhshadrach
Copy link
Member

As the docstring states, use dtype=object if you do not want pandas to do any inference on the dtype.

I checked both calamine and openpyxl, both readers are reading integer or Boolean values instead of e.g. TRUE. You can see this for openpyxl with:

from openpyxl import load_workbook
wb = load_workbook('test.xlsx', data_only=True)
for row in wb.worksheets[0].rows:
    for cell in row:
        print(cell, cell.value, cell.internal_value, cell.data_type)

As pandas only gets values through third-party libraries, they would need to support this first. It is likely there is a technical limitation in the Excel spec that prevents this, but I'm not certain.

As there is nothing pandas can do here, closing this issue.

@rhshadrach rhshadrach added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2024
@vignesh14052002
Copy link
Author

@rhshadrach, I am getting expecting results in openpyxl, the issue is with pandas, please look into it

Image

I tried debugging and found the place where conversion is happening, it is the _infer_types from ParserBase , here is the callstack

Image

eventhough if i pass dtype="object" , in column B [True, 1] is being converted to [1,1] at this line

Image

when i execute parsers.sanitize_objects(values, na_values), it is modifying the memory of values array

this is post execution state

Image

If i put values.copy() i get expected results

Image

Image

can you confirm, this is the right fix?

@rhshadrach rhshadrach reopened this Mar 19, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Mar 26, 2025

@vignesh14052002 - it would be helpful if you provided a reproducible example, e.g.

df = pd.DataFrame({"a": ["TRUE", "1"], "b": ["TRUE", "TRUE"], "c": ["1", "TRUE"]})
df.to_excel("test.xlsx")

Does that file reproduce the issue for you?

Also, it would be helpful if instead of posting screenshots, you used plaintext for code and permalinks for when you want to reference pandas code.

As far as I can tell so far, the code you are changing is not hit.

@vignesh14052002
Copy link
Author

@rhshadrach here is a reproducible code example

To generate input data

import pandas as pd

df = pd.DataFrame({"a":[True,True],"b":[1,True],"c":[True,1]})
df.to_excel("./preserve_data.xlsx",index=False)

reading with pandas (b and c column will be modified)

pd.read_excel("./preserve_data.xlsx")

reading with openpyxl (expected results)

import openpyxl

workbook = openpyxl.load_workbook("./preserve_data.xlsx")

for row in workbook.active.iter_rows(values_only=True):
    print(row)

@rhshadrach
Copy link
Member

Thanks @vignesh14052002 - it seems to me the issue is in pandas.io.parsers.base_parser.ParserBase._convert_to_ndarrays. There we call self._infer_types even when dtype="object". It seems like we should not be inferring the types at all in this case. However, this parser is used beyond Excel, I haven't yet looked into if this change would break anything else. Further investigations and PRs to fix are welcome.

Putting values.copy() prior to calling sanitize_objects breaks 37 tests for me; this is likely not the right approach.

@rhshadrach rhshadrach removed the Upstream issue Issue related to pandas dependency label Apr 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

4 participants