Skip to content

Commit a25a985

Browse files
bpo-28080: Add support for the fallback encoding in ZIP files (GH-32007)
* Add the metadata_encoding parameter in the zipfile.ZipFile constructor. * Add the --metadata-encoding option in the zipfile CLI. Co-authored-by: Stephen J. Turnbull <[email protected]>
1 parent c6cd3cc commit a25a985

File tree

5 files changed

+211
-11
lines changed

5 files changed

+211
-11
lines changed

Doc/library/zipfile.rst

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,8 @@ ZipFile Objects
139139

140140

141141
.. class:: ZipFile(file, mode='r', compression=ZIP_STORED, allowZip64=True, \
142-
compresslevel=None, *, strict_timestamps=True)
142+
compresslevel=None, *, strict_timestamps=True,
143+
metadata_encoding=None)
143144
144145
Open a ZIP file, where *file* can be a path to a file (a string), a
145146
file-like object or a :term:`path-like object`.
@@ -183,6 +184,10 @@ ZipFile Objects
183184
Similar behavior occurs with files newer than 2107-12-31,
184185
the timestamp is also set to the limit.
185186

187+
When mode is ``'r'``, *metadata_encoding* may be set to the name of a codec,
188+
which will be used to decode metadata such as the names of members and ZIP
189+
comments.
190+
186191
If the file is created with mode ``'w'``, ``'x'`` or ``'a'`` and then
187192
:meth:`closed <close>` without adding any files to the archive, the appropriate
188193
ZIP structures for an empty archive will be written to the file.
@@ -194,6 +199,19 @@ ZipFile Objects
194199
with ZipFile('spam.zip', 'w') as myzip:
195200
myzip.write('eggs.txt')
196201

202+
.. note::
203+
204+
*metadata_encoding* is an instance-wide setting for the ZipFile.
205+
It is not currently possible to set this on a per-member basis.
206+
207+
This attribute is a workaround for legacy implementations which produce
208+
archives with names in the current locale encoding or code page (mostly
209+
on Windows). According to the .ZIP standard, the encoding of metadata
210+
may be specified to be either IBM code page (default) or UTF-8 by a flag
211+
in the archive header.
212+
That flag takes precedence over *metadata_encoding*, which is
213+
a Python-specific extension.
214+
197215
.. versionadded:: 3.2
198216
Added the ability to use :class:`ZipFile` as a context manager.
199217

@@ -220,6 +238,10 @@ ZipFile Objects
220238
.. versionadded:: 3.8
221239
The *strict_timestamps* keyword-only argument
222240

241+
.. versionchanged:: 3.11
242+
Added support for specifying member name encoding for reading
243+
metadata in the zipfile's directory and file headers.
244+
223245

224246
.. method:: ZipFile.close()
225247

@@ -395,6 +417,15 @@ ZipFile Objects
395417
given.
396418
The archive must be open with mode ``'w'``, ``'x'`` or ``'a'``.
397419

420+
.. note::
421+
422+
The ZIP file standard historically did not specify a metadata encoding,
423+
but strongly recommended CP437 (the original IBM PC encoding) for
424+
interoperability. Recent versions allow use of UTF-8 (only). In this
425+
module, UTF-8 will automatically be used to write the member names if
426+
they contain any non-ASCII characters. It is not possible to write
427+
member names in any encoding other than ASCII or UTF-8.
428+
398429
.. note::
399430

400431
Archive names should be relative to the archive root, that is, they should not
@@ -868,6 +899,14 @@ Command-line options
868899

869900
Test whether the zipfile is valid or not.
870901

902+
.. cmdoption:: --metadata-encoding <encoding>
903+
904+
Specify encoding of member names for :option:`-l`, :option:`-e` and
905+
:option:`-t`.
906+
907+
.. versionadded:: 3.11
908+
909+
871910
Decompression pitfalls
872911
----------------------
873912

Doc/whatsnew/3.11.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -432,6 +432,12 @@ venv
432432
Third party code that also creates new virtual environments should do the same.
433433
(Contributed by Miro Hrončok in :issue:`45413`.)
434434

435+
zipfile
436+
-------
437+
438+
* Added support for specifying member name encoding for reading
439+
metadata in the zipfile's directory and file headers.
440+
(Contributed by Stephen J. Turnbull and Serhiy Storchaka in :issue:`28080`.)
435441

436442
fcntl
437443
-----

Lib/test/test_zipfile.py

Lines changed: 138 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,10 @@
2121
from random import randint, random, randbytes
2222

2323
from test.support import script_helper
24-
from test.support import (findfile, requires_zlib, requires_bz2,
25-
requires_lzma, captured_stdout, requires_subprocess)
24+
from test.support import (
25+
findfile, requires_zlib, requires_bz2, requires_lzma,
26+
captured_stdout, captured_stderr, requires_subprocess
27+
)
2628
from test.support.os_helper import (
2729
TESTFN, unlink, rmtree, temp_dir, temp_cwd, fd_count
2830
)
@@ -3210,5 +3212,139 @@ def test_inheritance(self, alpharep):
32103212
assert isinstance(file, cls)
32113213

32123214

3215+
class EncodedMetadataTests(unittest.TestCase):
3216+
file_names = ['\u4e00', '\u4e8c', '\u4e09'] # Han 'one', 'two', 'three'
3217+
file_content = [
3218+
"This is pure ASCII.\n".encode('ascii'),
3219+
# This is modern Japanese. (UTF-8)
3220+
"\u3053\u308c\u306f\u73fe\u4ee3\u7684\u65e5\u672c\u8a9e\u3067\u3059\u3002\n".encode('utf-8'),
3221+
# This is obsolete Japanese. (Shift JIS)
3222+
"\u3053\u308c\u306f\u53e4\u3044\u65e5\u672c\u8a9e\u3067\u3059\u3002\n".encode('shift_jis'),
3223+
]
3224+
3225+
def setUp(self):
3226+
self.addCleanup(unlink, TESTFN)
3227+
# Create .zip of 3 members with Han names encoded in Shift JIS.
3228+
# Each name is 1 Han character encoding to 2 bytes in Shift JIS.
3229+
# The ASCII names are arbitrary as long as they are length 2 and
3230+
# not otherwise contained in the zip file.
3231+
# Data elements are encoded bytes (ascii, utf-8, shift_jis).
3232+
placeholders = ["n1", "n2"] + self.file_names[2:]
3233+
with zipfile.ZipFile(TESTFN, mode="w") as tf:
3234+
for temp, content in zip(placeholders, self.file_content):
3235+
tf.writestr(temp, content, zipfile.ZIP_STORED)
3236+
# Hack in the Shift JIS names with flag bit 11 (UTF-8) unset.
3237+
with open(TESTFN, "rb") as tf:
3238+
data = tf.read()
3239+
for name, temp in zip(self.file_names, placeholders[:2]):
3240+
data = data.replace(temp.encode('ascii'),
3241+
name.encode('shift_jis'))
3242+
with open(TESTFN, "wb") as tf:
3243+
tf.write(data)
3244+
3245+
def _test_read(self, zipfp, expected_names, expected_content):
3246+
# Check the namelist
3247+
names = zipfp.namelist()
3248+
self.assertEqual(sorted(names), sorted(expected_names))
3249+
3250+
# Check infolist
3251+
infos = zipfp.infolist()
3252+
names = [zi.filename for zi in infos]
3253+
self.assertEqual(sorted(names), sorted(expected_names))
3254+
3255+
# check getinfo
3256+
for name, content in zip(expected_names, expected_content):
3257+
info = zipfp.getinfo(name)
3258+
self.assertEqual(info.filename, name)
3259+
self.assertEqual(info.file_size, len(content))
3260+
self.assertEqual(zipfp.read(name), content)
3261+
3262+
def test_read_with_metadata_encoding(self):
3263+
# Read the ZIP archive with correct metadata_encoding
3264+
with zipfile.ZipFile(TESTFN, "r", metadata_encoding='shift_jis') as zipfp:
3265+
self._test_read(zipfp, self.file_names, self.file_content)
3266+
3267+
def test_read_without_metadata_encoding(self):
3268+
# Read the ZIP archive without metadata_encoding
3269+
expected_names = [name.encode('shift_jis').decode('cp437')
3270+
for name in self.file_names[:2]] + self.file_names[2:]
3271+
with zipfile.ZipFile(TESTFN, "r") as zipfp:
3272+
self._test_read(zipfp, expected_names, self.file_content)
3273+
3274+
def test_read_with_incorrect_metadata_encoding(self):
3275+
# Read the ZIP archive with incorrect metadata_encoding
3276+
expected_names = [name.encode('shift_jis').decode('koi8-u')
3277+
for name in self.file_names[:2]] + self.file_names[2:]
3278+
with zipfile.ZipFile(TESTFN, "r", metadata_encoding='koi8-u') as zipfp:
3279+
self._test_read(zipfp, expected_names, self.file_content)
3280+
3281+
def test_read_with_unsuitable_metadata_encoding(self):
3282+
# Read the ZIP archive with metadata_encoding unsuitable for
3283+
# decoding metadata
3284+
with self.assertRaises(UnicodeDecodeError):
3285+
zipfile.ZipFile(TESTFN, "r", metadata_encoding='ascii')
3286+
with self.assertRaises(UnicodeDecodeError):
3287+
zipfile.ZipFile(TESTFN, "r", metadata_encoding='utf-8')
3288+
3289+
def test_read_after_append(self):
3290+
newname = '\u56db' # Han 'four'
3291+
expected_names = [name.encode('shift_jis').decode('cp437')
3292+
for name in self.file_names[:2]] + self.file_names[2:]
3293+
expected_names.append(newname)
3294+
expected_content = (*self.file_content, b"newcontent")
3295+
3296+
with zipfile.ZipFile(TESTFN, "a") as zipfp:
3297+
zipfp.writestr(newname, "newcontent")
3298+
self.assertEqual(sorted(zipfp.namelist()), sorted(expected_names))
3299+
3300+
with zipfile.ZipFile(TESTFN, "r") as zipfp:
3301+
self._test_read(zipfp, expected_names, expected_content)
3302+
3303+
with zipfile.ZipFile(TESTFN, "r", metadata_encoding='shift_jis') as zipfp:
3304+
self.assertEqual(sorted(zipfp.namelist()), sorted(expected_names))
3305+
for i, (name, content) in enumerate(zip(expected_names, expected_content)):
3306+
info = zipfp.getinfo(name)
3307+
self.assertEqual(info.filename, name)
3308+
self.assertEqual(info.file_size, len(content))
3309+
if i < 2:
3310+
with self.assertRaises(zipfile.BadZipFile):
3311+
zipfp.read(name)
3312+
else:
3313+
self.assertEqual(zipfp.read(name), content)
3314+
3315+
def test_write_with_metadata_encoding(self):
3316+
ZF = zipfile.ZipFile
3317+
for mode in ("w", "x", "a"):
3318+
with self.assertRaisesRegex(ValueError,
3319+
"^metadata_encoding is only"):
3320+
ZF("nonesuch.zip", mode, metadata_encoding="shift_jis")
3321+
3322+
def test_cli_with_metadata_encoding(self):
3323+
errmsg = "Non-conforming encodings not supported with -c."
3324+
args = ["--metadata-encoding=shift_jis", "-c", "nonesuch", "nonesuch"]
3325+
with captured_stdout() as stdout:
3326+
with captured_stderr() as stderr:
3327+
self.assertRaises(SystemExit, zipfile.main, args)
3328+
self.assertEqual(stdout.getvalue(), "")
3329+
self.assertIn(errmsg, stderr.getvalue())
3330+
3331+
with captured_stdout() as stdout:
3332+
zipfile.main(["--metadata-encoding=shift_jis", "-t", TESTFN])
3333+
listing = stdout.getvalue()
3334+
3335+
with captured_stdout() as stdout:
3336+
zipfile.main(["--metadata-encoding=shift_jis", "-l", TESTFN])
3337+
listing = stdout.getvalue()
3338+
for name in self.file_names:
3339+
self.assertIn(name, listing)
3340+
3341+
os.mkdir(TESTFN2)
3342+
self.addCleanup(rmtree, TESTFN2)
3343+
zipfile.main(["--metadata-encoding=shift_jis", "-e", TESTFN, TESTFN2])
3344+
listing = os.listdir(TESTFN2)
3345+
for name in self.file_names:
3346+
self.assertIn(name, listing)
3347+
3348+
32133349
if __name__ == "__main__":
32143350
unittest.main()

Lib/zipfile.py

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -480,7 +480,7 @@ def FileHeader(self, zip64=None):
480480

481481
def _encodeFilenameFlags(self):
482482
try:
483-
return self.filename.encode('ascii'), self.flag_bits
483+
return self.filename.encode('ascii'), self.flag_bits & ~_MASK_UTF_FILENAME
484484
except UnicodeEncodeError:
485485
return self.filename.encode('utf-8'), self.flag_bits | _MASK_UTF_FILENAME
486486

@@ -1240,7 +1240,7 @@ class ZipFile:
12401240
_windows_illegal_name_trans_table = None
12411241

12421242
def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=True,
1243-
compresslevel=None, *, strict_timestamps=True):
1243+
compresslevel=None, *, strict_timestamps=True, metadata_encoding=None):
12441244
"""Open the ZIP file with mode read 'r', write 'w', exclusive create 'x',
12451245
or append 'a'."""
12461246
if mode not in ('r', 'w', 'x', 'a'):
@@ -1259,6 +1259,12 @@ def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=True,
12591259
self.pwd = None
12601260
self._comment = b''
12611261
self._strict_timestamps = strict_timestamps
1262+
self.metadata_encoding = metadata_encoding
1263+
1264+
# Check that we don't try to write with nonconforming codecs
1265+
if self.metadata_encoding and mode != 'r':
1266+
raise ValueError(
1267+
"metadata_encoding is only supported for reading files")
12621268

12631269
# Check if we were passed a file-like object
12641270
if isinstance(file, os.PathLike):
@@ -1389,13 +1395,13 @@ def _RealGetContents(self):
13891395
if self.debug > 2:
13901396
print(centdir)
13911397
filename = fp.read(centdir[_CD_FILENAME_LENGTH])
1392-
flags = centdir[5]
1398+
flags = centdir[_CD_FLAG_BITS]
13931399
if flags & _MASK_UTF_FILENAME:
13941400
# UTF-8 file names extension
13951401
filename = filename.decode('utf-8')
13961402
else:
13971403
# Historical ZIP filename encoding
1398-
filename = filename.decode('cp437')
1404+
filename = filename.decode(self.metadata_encoding or 'cp437')
13991405
# Create ZipInfo instance to store file information
14001406
x = ZipInfo(filename)
14011407
x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
@@ -1572,7 +1578,7 @@ def open(self, name, mode="r", pwd=None, *, force_zip64=False):
15721578
# UTF-8 filename
15731579
fname_str = fname.decode("utf-8")
15741580
else:
1575-
fname_str = fname.decode("cp437")
1581+
fname_str = fname.decode(self.metadata_encoding or "cp437")
15761582

15771583
if fname_str != zinfo.orig_filename:
15781584
raise BadZipFile(
@@ -2461,27 +2467,36 @@ def main(args=None):
24612467
help='Create zipfile from sources')
24622468
group.add_argument('-t', '--test', metavar='<zipfile>',
24632469
help='Test if a zipfile is valid')
2470+
parser.add_argument('--metadata-encoding', metavar='<encoding>',
2471+
help='Specify encoding of member names for -l, -e and -t')
24642472
args = parser.parse_args(args)
24652473

2474+
encoding = args.metadata_encoding
2475+
24662476
if args.test is not None:
24672477
src = args.test
2468-
with ZipFile(src, 'r') as zf:
2478+
with ZipFile(src, 'r', metadata_encoding=encoding) as zf:
24692479
badfile = zf.testzip()
24702480
if badfile:
24712481
print("The following enclosed file is corrupted: {!r}".format(badfile))
24722482
print("Done testing")
24732483

24742484
elif args.list is not None:
24752485
src = args.list
2476-
with ZipFile(src, 'r') as zf:
2486+
with ZipFile(src, 'r', metadata_encoding=encoding) as zf:
24772487
zf.printdir()
24782488

24792489
elif args.extract is not None:
24802490
src, curdir = args.extract
2481-
with ZipFile(src, 'r') as zf:
2491+
with ZipFile(src, 'r', metadata_encoding=encoding) as zf:
24822492
zf.extractall(curdir)
24832493

24842494
elif args.create is not None:
2495+
if encoding:
2496+
print("Non-conforming encodings not supported with -c.",
2497+
file=sys.stderr)
2498+
sys.exit(1)
2499+
24852500
zip_name = args.create.pop(0)
24862501
files = args.create
24872502

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Add the *metadata_encoding* parameter in the :class:`zipfile.ZipFile`
2+
constructor and the ``--metadata-encoding`` option in the :mod:`zipfile`
3+
CLI to allow reading zipfiles using non-standard codecs to encode the
4+
filenames within the archive.

0 commit comments

Comments
 (0)