Skip to content

Allow gzip mtime to be a datetime object #128584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gchqdev227 opened this issue Jan 7, 2025 · 7 comments
Open

Allow gzip mtime to be a datetime object #128584

gchqdev227 opened this issue Jan 7, 2025 · 7 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@gchqdev227
Copy link

gchqdev227 commented Jan 7, 2025

Feature or enhancement

Proposal:

Currently, the mtime optional parameter of the gzip.GzipFile class and gzip.compress function can be an float representing a Unix timestamp, or None to use the current time. I propose allowing this to also be a datetime.datetime object. All of these would then be possible:

from datetime import datetime, timezone
import gzip
import zoneinfo

compress(data)  # compress using current time
compress(data, mtime=0)  # ignore the time
compress(data, mtime=1736247159)  # specify the time with a Unix timestamp
compress(data, mtime=datetime(2025, 1, 7, tzinfo=timezone.utc))  # specify time with a datetime object
compress(data, mtime=datetime(2025, 1, 7, tzinfo=zoneinfo.ZoneInfo(key='US/Eastern')))  # specify a different timezone

This avoids users needing to cast their datetime objects into timestamps before calling compress or creating a GzipFile instance, which can lead to mistakes.

I have an initial PR for this ready to go.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@gchqdev227 gchqdev227 added the type-feature A feature request or enhancement label Jan 7, 2025
@picnixz picnixz added the stdlib Python modules in the Lib dir label Jan 7, 2025
@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Jan 13, 2025

This would create a dependency between gzip and datetime modules. This is not desirable.

It is not hard to call the timestamp() method before calling compress().

@gchqdev227
Copy link
Author

Hi @serhiy-storchaka, thank you for your comments.

This would create a dependency between gzip and datetime modules. This is not desirable.

This is a fair point, although I would point out that datetime is already imported in a handful of stdlib modules/packages, such as calendar, http and tomllib (a more recent example). I agree that creating this link where it is not completely necessary might be a harder sell, although this issue is a precursor to a larger gzip issue we intend to open in the future, but are trying to keep the changes smaller.

It is not hard to call the timestamp() method before calling compress().

This is true, although I would use the same argument as with supporting pathlib.Path objects instead of expecting users to cast them to a string first (although admittedly os.PathLike exists which reduces direct imports of pathlib).

@picnixz
Copy link
Member

picnixz commented Feb 24, 2025

This is true, although I would use the same argument as with supporting pathlib.Path objects instead of expecting users to cast

The difference is that we can cast it to str in the stdlib without a need to additional imports (I think?)

although this issue is a precursor to a larger gzip issue we intend to open in the future, but are trying to keep the changes smaller.

Could you give us more details here please?


From a purely personal opinion, I think mtime is really a timestamp, not a datetime object. The difference with Path objects and strings is that Path objects are really castable to strings while datetime objects are not canonically casted to timestamps with a built-in function (I'm not sure they implement an __int__). They also hold more information than just an integer-like timestamp IMO.

@gchqdev227
Copy link
Author

Hi @picnixz, thank you for your comments.

although this issue is a precursor to a larger gzip issue we intend to open in the future, but are trying to keep the changes smaller.

Could you give us more details here please?

I would like to expand the current gzip functionality to allow for easier parsing/specifying of the gzip headers. The first stage in this would be to add some sort of parse_gzip_header function which would more or less emulate _read_gzip_header function, expect without throwing away the info. The initial prototype I wrote for this used datetime objects as a natural representation of the creation date.

From a purely personal opinion, I think mtime is really a timestamp, not a datetime object. The difference with Path objects and strings is that Path objects are really castable to strings while datetime objects are not canonically casted to timestamps with a built-in function (I'm not sure they implement an int). They also hold more information than just an integer-like timestamp IMO.

This is fair, but that timestamp does represent a specific date and time, which for the purposes of investigating the metadata seems like a good idea? It also gives us a chance to raise an error if a user is in danger of leaking location information unknowingly, as mentioned in this comment on the PR:

I've pushed a new commit which causes the relevant compress methods to raise a ValueError if the datetime object is not timezone-aware. This is to address security concerns where naive datetimes can lead to informational leakage in web servers and similar contexts: see here for one such example. This also enforces the RFC.

@picnixz
Copy link
Member

picnixz commented Feb 25, 2025

I would like to expand the current gzip functionality to allow for easier parsing/specifying of the gzip headers. The first stage in this would be to add some sort of parse_gzip_header function which would more or less emulate _read_gzip_header function, expect without throwing away the info. The initial prototype I wrote for this used datetime objects as a natural representation of the creation date.

I would suggest not changing the current API for the sake of a future API. In particular, I think it's better to still use pure timestamp for the creation date and not datetime objects. In addition, I think this kind of feature request should probably have a prior discussion before.

This is fair, but that timestamp does represent a specific date and time, which for the purposes of investigating the metadata seems like a good idea

Yes but I see integral timestamps as being more generic than datetime objects. Datetime objects come with an interface and more responsibilities than just an integer-like value. We can also pass fractional timestamps as mtime as they are truncated by int(mtime). I don't think the standard library needs to support datetime objects as it can be easily done on the user's side by .timestamp() before and possibly check for tzinfo being None.

Now, for a third-party package, I think it would be fine to have that feature, but for the standard library, this requires to maintain those new objects. I suggest rewriting the initial prototype using pure integers and present a PoC on Discourse to gain traction.

I also think that if we change gzip to support mtime with datetime objects, we should also change other compression modules for the same reasons and again there is a maintenance cost that we usually want to avoid.

It also gives us a chance to raise an error if a user is in danger of leaking location information unknowingly

Well, it's the responsibility of the server to use UTC datetime objects, namely they should instead use datetime.datetime.now(datetime.UTC).timestamp() instead of datetime.datetime.now(). I don't think we should take care of this on our side. This is more of a privacy issue.

This also enforces the RFC.

I fail to understand how this enforces the RFC as I could only find:

This gives the most recent modification time of the original
file being compressed. The time is in Unix format, i.e.,
seconds since 00:00:00 GMT, Jan. 1, 1970. If the compressed
data did not come from a file, MTIME is set to the time at which
compression started.


We can however mention in the docs how to use a datetime object with the compression function, as well as what it entails from a privacy PoV. Note that while it can leak the location, it's only if the querier knows when the compression has been done and they can compare with whether the embedded timestamp is correct or not (assuming again that the server used a non-UTC datetime object to create the timestamp). They can easily create a timezone-agnostic timestamp by doing time.mktime(time.gmtime()). Am I missing something else?

@gchqdev227
Copy link
Author

I would suggest not changing the current API for the sake of a future API.

This makes sense – the intention was to avoid making multiple changes in a single feature, but I appreciate this perspective.

I don't think the standard library needs to support datetime objects as it can be easily done on the user's side by .timestamp() before and possibly check for tzinfo being None.

Fair enough.

Now, for a third-party package, I think it would be fine to have that feature, but for the standard library, this requires to maintain those new objects. I suggest rewriting the initial prototype using pure integers and present a PoC on Discourse to gain traction.

I can certainly do this. By "present a PoC on Discourse", presumably you mean starting a new thread here?

This also enforces the RFC.

I fail to understand how this enforces the RFC as I could only find:

This gives the most recent modification time of the original
file being compressed. The time is in Unix format, i.e.,
seconds since 00:00:00 GMT, Jan. 1, 1970. If the compressed
data did not come from a file, MTIME is set to the time at which
compression started.

I'm referring to the "GMT" part I suppose – preventing scenarios where a user creates a naive datetime object in a non-GMT timezone, and converts to a timestamp without specifying their timezone. I agree that time.mktime(time.gmtime()) does solve this issue (as specifying the timezone to datetime.now() does), but it's still a footgun. However, I appreciate that it's not the responsibility of Python to mitigate against every footgun if it lessens the maintainability!

Note that while it can leak the location, it's only if the querier knows when the compression has been done and they can compare with whether the embedded timestamp is correct or not (assuming again that the server used a non-UTC datetime object to create the timestamp).

The classic example is web requests, which admittedly isn't the most common use case of the gzip module, but that scenario would result in a delta between response time and GZIP header.

@picnixz
Copy link
Member

picnixz commented Mar 2, 2025

I can certainly do this. By "present a PoC on Discourse", presumably you mean starting a new thread here?

yes

However, I appreciate that it's not the responsibility of Python to mitigate against every footgun if it lessens the maintainability!

In the case of this specific feature request, that's what I think. The maintenance cost outweights the benefits IMO. We can however improve the docs and/or indicate the use of time.mktime.

but that scenario would result in a delta between response time and GZIP header.

that's what I meant by querier knowing when it has been done. Btw, wouldn't it be a use-case if the server serves gzip compressed files? I think Django has a gzip compression middleware but I don't know if it imports gzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

3 participants