Skip to content

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 30, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ Documentation Changes
Bug Fixes
~~~~~~~~~

- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`)

Conversion
^^^^^^^^^^

Expand Down
16 changes: 9 additions & 7 deletions pandas/io/packers.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,6 @@ def read(fh):

# see if we have an actual file
if isinstance(path_or_buf, compat.string_types):

try:
exists = os.path.exists(path_or_buf)
except (TypeError, ValueError):
Expand All @@ -202,18 +201,21 @@ def read(fh):
with open(path_or_buf, 'rb') as fh:
return read(fh)

# treat as a binary-like
if isinstance(path_or_buf, compat.binary_type):
# treat as a binary-like
fh = None
try:
fh = compat.BytesIO(path_or_buf)
return read(fh)
# We can't distinguish between a path and a buffer of bytes in
# Python 2 so instead assume the first byte of a valid path is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply check os.path.exists(path_or_buf) (see how this is done in pandas.io.json.json.read_json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your expl. But if os.path.exists(path_or_buf) fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is when an incorrect path is accidentally used. As it stands the code can continue running with invalid data until it crashes mysteriously elsewhere or silently produces an incorrect result (the latter motivated me to make this PR).

# less than 0x80.
if compat.PY3 or ord(path_or_buf[0]) >= 0x80:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems fragile. IIUC, it's possible to have filenames that, according to Python, start with characters above 0x80, even if the filesystem does some encoding on the filename before reading on writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't see a way to make this avoid the edge cases like this for Python 2 with the current API. I think this way minimised the number of affected users and if it does affect someone the check can be bypassed using ./filename.

Copy link
Contributor Author

@chrisburr chrisburr Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

According to the msgpack spec "Applications can assign 0 to 127 to store application-specific type information.". I believe pandas doesn't currently use this so this assumes if the first byte is below 0x80 it was supposed to be a filename rather than a collection of bytes to decode.

I'm not sure what the correct behaviour should be for passing read_msgpack("").

fh = compat.BytesIO(path_or_buf)
return read(fh)
finally:
if fh is not None:
fh.close()

# a buffer like
if hasattr(path_or_buf, 'read') and compat.callable(path_or_buf.read):
elif hasattr(path_or_buf, 'read') and compat.callable(path_or_buf.read):
# treat as a buffer like
return read(path_or_buf)

raise ValueError('path_or_buf needs to be a string file path or file-like')
Expand Down
22 changes: 21 additions & 1 deletion pandas/tests/io/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import pandas.util.testing as tm

from pandas.io import common
from pandas.compat import is_platform_windows, StringIO
from pandas.compat import is_platform_windows, StringIO, FileNotFoundError

from pandas import read_csv, concat

Expand Down Expand Up @@ -125,6 +125,26 @@ def test_iterator(self):
tm.assert_frame_equal(first, expected.iloc[[0]])
tm.assert_frame_equal(concat(it), expected.iloc[1:])

@pytest.mark.parametrize('reader, module, error_class, fn_ext', [
(pd.read_csv, 'os', FileNotFoundError, 'csv'),
(pd.read_table, 'os', FileNotFoundError, 'csv'),
(pd.read_fwf, 'os', FileNotFoundError, 'txt'),
(pd.read_excel, 'xlrd', FileNotFoundError, 'xlsx'),
(pd.read_feather, 'feather', Exception, 'feather'),
(pd.read_hdf, 'tables', FileNotFoundError, 'h5'),
(pd.read_stata, 'os', FileNotFoundError, 'dta'),
(pd.read_sas, 'os', FileNotFoundError, 'sas7bdat'),
(pd.read_json, 'os', ValueError, 'json'),
(pd.read_msgpack, 'os', ValueError, 'mp'),
(pd.read_pickle, 'os', FileNotFoundError, 'pickle'),
])
def test_read_non_existant(self, reader, module, error_class, fn_ext):
pytest.importorskip(module)

path = os.path.join(HERE, 'data', 'does_not_exist.' + fn_ext)
with pytest.raises(error_class):
reader(path)

@pytest.mark.parametrize('reader, module, path', [
(pd.read_csv, 'os', os.path.join(HERE, 'data', 'iris.csv')),
(pd.read_table, 'os', os.path.join(HERE, 'data', 'iris.csv')),
Expand Down