Skip to content

Customize UUIDv7 generation for database partitioning #130843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sscherfke opened this issue Mar 4, 2025 · 14 comments
Open

Customize UUIDv7 generation for database partitioning #130843

sscherfke opened this issue Mar 4, 2025 · 14 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@sscherfke
Copy link

sscherfke commented Mar 4, 2025

Feature or enhancement

Proposal:

Support for UUIDv7 via uuid7() has just landed in main: #89083

One use-case for UUIDv7 is using it as PK in databases. Since it is time based, it can also be used as partition key (e.g., to use one partion for each day). In order to calculate the partition range, you need calculate the "minimal" UUID for a given date (i.e., 2025-04-05 00:00:00 and use all zeros for the random bits => 0196033f-4400-7000-8000-000000000000).

I'm totally fine with uuid.uuid7() not taking any arguments, but it would be cool if the building blocks for generating a UUIDv7 based on custom unix_ts_ms, counter, and tail could be exposed as well.

def min_uuid7(date: datetime.datetime | None) -> UUID:
    # This is just for convenience and could be left out:
    if date is None:
        today = datetime.date.today()
        date = datetime.datetime(
            today.year, today.month, today.day, tzinfo=datetime.UTC
        )

    # Provide a custom timestamp and a custom counter and tail
    timestamp_ms = int(date.timestamp() * 1_000)
    counter, tail = 0, 0

    # The remainder is the same as in uuid7():
    unix_ts_ms = timestamp_ms & 0xFFFF_FFFF_FFFF
    counter_msbs = counter >> 30
    # keep 12 counter's MSBs and clear variant bits
    counter_hi = counter_msbs & 0x0FFF
    # keep 30 counter's LSBs and clear version bits
    counter_lo = counter & 0x3FFF_FFFF
    # ensure that the tail is always a 32-bit integer (by construction,
    # it is already the case, but future interfaces may allow the user
    # to specify the random tail)
    tail &= 0xFFFF_FFFF
    
    int_uuid_7 = unix_ts_ms << 80
    int_uuid_7 |= counter_hi << 64
    int_uuid_7 |= counter_lo << 32
    int_uuid_7 |= tail
    # by construction, the variant and version bits are already cleared
    int_uuid_7 |= _RFC_4122_VERSION_7_FLAGS
    return UUID(int=int_uuid_7)
>>> min_uuid7(datetime.datetime(2025, 4, 5, tzinfo=datetime.UTC))
UUID('0196033f-4400-7000-8000-000000000000')

Another useful addition might be a helper that recovers the original datetime/timestamp from a UUIDv7. I understand that this is additional code that might be slightly out of context, but such functions - like uuid7() - would probably not need to be changed, but are not trivial to implement for "normal users".

These functions could look like this:

def uuid_to_timestamp_ms(uuid: UUID) -> int:
    uuid_flags = uuid.int & _RFC_4122_VERSION_7_FLAGS
    if uuid_flags != _RFC_4122_VERSION_7_FLAGS:
        raise ValueError(f"{uuid} is not a v7 UUID.")
    return int.from_bytes(uuid.bytes[:6])


def uuid_to_datetime(uuid: UUID) -> datetime.datetime:
    ms_since_epoch = uuid_to_timestamp_ms(uuid)
    return datetime.datetime.fromtimestamp(ms_since_epoch / 1_000, tz=datetime.UTC)
>>> d = datetime.datetime(2025, 4, 5, tzinfo=datetime.UTC)
>>> u = min_uuid7(d)
>>> assert uuid_to_datetime(u) == d

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@sscherfke sscherfke added the type-feature A feature request or enhancement label Mar 4, 2025
@picnixz picnixz added the stdlib Python modules in the Lib dir label Mar 4, 2025
@picnixz
Copy link
Member

picnixz commented Mar 4, 2025

Another useful addition might be a helper that recovers the original datetime/timestamp from a UUIDv7

For this one, I plan to somehow make it work under #120878. I don't know how to make it work properly though because the notion of time_lo/time_mid/time_hi is different for UUIDv1/v6 and UUIDv7 (the first two have 60-bit timestamp, UUIDv7 has 48-bit timestamp).

@picnixz
Copy link
Member

picnixz commented Mar 4, 2025

As for min_uuid7(), I think it's better to actually support timestamp in general, not date and datetime. To make a UTC timestamp, one could do time.mktime(time.gmtime()) (IIRC). WDYT?

@sscherfke
Copy link
Author

Using a timestamp would be okay, but the users’ mindset for this function (or at least for my use case ;-)) is "I need to create a new DB partition for today / the next month which is YYYY-MM-DD 0 o'clock. Gimme the minimal UUIDv7 for that!", so supporting datetimes would be very convenient.

@picnixz
Copy link
Member

picnixz commented Mar 4, 2025

It's possible to convert the datetime object to a timestaml via .timestamp(). One reason for accepting a timestamp is essentially to make the interface more flexible for the standard library and easier to maintain (maintenance cost is something that needs to be taken into account). Also, a timestamp is timezone agnostic.

I'm leaving for 10 days so I won't be able to reply except on mobile.

@sscherfke
Copy link
Author

I understand your reasoning and something with a timestamp is better than nothing. :-)

@picnixz
Copy link
Member

picnixz commented Mar 4, 2025

We actually had a similar discussion on whether to accept or not datetime objects for gzip in #128584. I see reasons not to but I also see reasons to. I think we need to find an equilibrium between what would be the most useful and what would be the best solution for future compatibility (remember that once we decide on something for the standard library, it becomes kind of "frozen" and requires a deprecation period for any changes we make).

@picnixz picnixz self-assigned this Mar 4, 2025
@sergeyprokhorenko
Copy link

Using a timestamp would be okay, but the users’ mindset for this function (or at least for my use case ;-)) is "I need to create a new DB partition for today / the next month which is YYYY-MM-DD 0 o'clock. Gimme the minimal UUIDv7 for that!", so supporting datetimes would be very convenient.

It is enough to take the left segment of the required length from the UUID as the partition key. The accuracy does not necessarily have to be exactly the same as 24 hours.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Mar 6, 2025

Another useful addition might be a helper that recovers the original datetime/timestamp from a UUIDv7

For this one, I plan to somehow make it work under #120878. I don't know how to make it work properly though because the notion of time_lo/time_mid/time_hi is different for UUIDv1/v6 and UUIDv7 (the first two have 60-bit timestamp, UUIDv7 has 48-bit timestamp).

Nobody really needs version 6. Only version 1 and 7. See the uuid_extract_timestamp() function here for an example

@sscherfke
Copy link
Author

It is enough to take the left segment of the required length from the UUID as the partition key. The accuracy does not necessarily have to be exactly the same as 24 hours.

You still need to pass in a custom timestamp/datetime to calculate that segment for the given time and in advance.

@picnixz
Copy link
Member

picnixz commented Mar 28, 2025

First of all, I've added the possibility to recover the timestamp for UUIDv7 objects as part of time. I'll think of how to make the timestamp customization easier. For instance: what should I do with respect to the last timestamp? I'll read the RFC for that and try to make something before the beta (otherwise we'll only be able to add it in 3.15, very sorry for this)

@picnixz
Copy link
Member

picnixz commented Apr 5, 2025

The RFC does not formally define the timestamp alteration of a UUID:

Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real-clock value within a UUID to 1) correct inaccurate clocks, 2) handle leap seconds, or 3) obtain a millisecond value by dividing by 1024 (or some other value) for performance reasons (instead of dividing a number of microseconds by 1000). This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time. If UUIDs do not need to be frequently generated, the UUIDv1 or UUIDv6 timestamp can simply be the system time multiplied by the number of 100-nanosecond intervals per system-time interval.

So, one way to do it, is to actually entirely ignore the stored timestamp but not the counter. In other words, if someone specifies an explicit timestamp, then they won't advance the internal global UUIDv7 clock, but they will advance the counter. The reason why advancing the timestamp is a bad idea is as follows: let's say I use the last possible UUIDv7 timestamp as my custom timestamp. Then subsequent calls to uuid7() will be locked in the future which is clearly a bad idea. OTOH, if I overflow the counter, this is not really an issue as it's meant to be randomized at every millisecond. What do you think?

2025-04-05 00:00:00 and use all zeros for the random bits

I would prefer not exposing this. It's possible to ignore the tail bits by working with time only (which for UUIDv7 would return the 48 MSBs that you're interested in)

@sscherfke
Copy link
Author

sscherfke commented Apr 5, 2025

Here are the uuid7 related function I currently need:

  • uuid7() generates an RFC compliant UUID7
  • uuid7_for_date(datetime) generates a UUID7 for a specific date. Tail and Couter are random. Does not modify the global state. This is very useful if you want to migrate old (database) record for which you already have a timestamp and which should be kept in the same order (important, if the UUID7 is used as primary key and/or partitioning key).
  • min_uuid7(datetime) generates a UUID7 for the specified date. Tail and Counter are all 0. Does not modify the global state. Needed to calculate partition borders or selection boundaries (give me all records >= today and < tomorrow).
  • uuid7_to_timestamp(uuid)/uuid7_to_datetime(uuid) can be used to get the original time. Useful for testing, logging and presentation (for users).

Again, I could understand if this is nothing that you would put into the stdlib, and I have implemented everything I need, so I am not depended on it.
However, these are the functions that are needed if you want to use UUID7 as PK in a database which is afaik one of its most important use cases and I think, that these functions would be very helpful for many other users, too. :-)

Here is a GIST with the code: https://gist.github.com/sscherfke/bf68627762d0b71843e7193c0b1654f8

@picnixz
Copy link
Member

picnixz commented Apr 5, 2025

IMO, all those functions, except for uuid7_for_date which would take a timestamp and not a datetime, should be left out of the standard library. The uuid module is really for RFC-compliant. uuid7_to_timestamp(uuid)/uuid7_to_datetime(uuid) is more a recipe so it shouldn't also be part of the stdlib.

@ericvsmith
Copy link
Member

I think it would be useful to document these in a uuid7 recipe section, then.

picnixz added a commit that referenced this issue Apr 7, 2025
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants