Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v7 should have a counter option? #717

Open
rogusdev opened this issue Nov 5, 2023 · 30 comments
Open

v7 should have a counter option? #717

rogusdev opened this issue Nov 5, 2023 · 30 comments

Comments

@rogusdev
Copy link

rogusdev commented Nov 5, 2023

Per the new draft "With this method rand_a section of UUIDv7 SHOULD be utilized as fixed-length dedicated counter bits that are incremented by one for every UUID generation."

That's in the "Fixed-Length Dedicated Counter Bits (Method 1)" section under https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#monotonicity_counters which is directly linked from https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#section-5.2 under the rand_a explanation.

To my reading, that means it should have a counter, like v6 (and v1), but instead the current v7 builder implementation https://docs.rs/uuid/latest/src/uuid/v7.rs.html#47-53 is just random for rand_a. Am I missing something on this one? If not, and this is a desirable feature, I might look into putting up a PR to add the counter support.

@KodrAus
Copy link
Member

KodrAus commented Nov 15, 2023

Hi @rogusdev 👋

I'd be keen to go catch up on some of the discussion around that. Those sections look like recommendations rather than requirements, so I think uuid's current implementation that fills the entire 74 bits of rand_a and rand_b with random data is still compliant.

It seems a bit overkill to me to need a clock, a source of randomness, and a monotonic counter to generate V7 UUIDs, but we could consider adding *_monotonic variants of the v7 methods that accept a ClockSequence like v1 and v6 do.

@KodrAus
Copy link
Member

KodrAus commented Nov 16, 2023

The test vector for UUIDv7 appears to use a fully random value for rand_a and rand_b: https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#name-example-of-a-uuidv7-value

I think we should consider supporting this, but don't think we need to block stabilizing the current APIs on it.

@rogusdev
Copy link
Author

rogusdev commented Nov 16, 2023

From my perspective, while the main point of uuid v7 is to put the clock at the front, so that db indexes can btree on them properly, counters add additional safety mechanisms that are quite to my taste: if you are generating ids in a single thread, you are guaranteed that no duplicates can happen, if there are less than the counter limit per millisecond. While the degree of randomness involved otherwise is certainly extremely unlikely to have duplicates, I am a big fan of guarantees.

Which is in fact why they put it into the RFC, as that is also a feature in many other popular id libraries.

@rogusdev rogusdev reopened this Nov 16, 2023
@rogusdev
Copy link
Author

That said, "SHOULD" is indeed a recommendation, rather than requirement, and they list multiple alternatives. So I support having a separate set of ctor functions to call for the counter version(s).

@sergeyprokhorenko
Copy link

Hi Ashley,

I'm Sergey Prokhorenko, a contributor to rfc4122bis and a counter enthusiast.

Some developers implement the counter:

UUIDv7 with the counter for PostgreSQL is currently being developed in C by Andrey Borodin:

But it seems to me that Rust is also a good tool for such development.

@KodrAus
Copy link
Member

KodrAus commented Dec 13, 2023

We've currently actually got access to a counter in our Timestamp type, because that's where it gets stashed for v1 and v6 UUIDs. How does the following sound to you:

  1. Don't feature gate Timestamp.counter on v1 or v6; just make it always available.
  2. Use the counter value if it's non-zero in Uuid::new_v7, otherwise use random bytes.
  3. Add a Builder::from_timestamp_millis_counter(millis: u64, counter: u16, random_bytes: [u8; 6]) method.

An alternative for 2. would be to add Uuid::new_v7_counter and Uuid::now_v7_counter methods that use the counter.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Dec 14, 2023

We've currently actually got access to a counter in our Timestamp type, because that's where it gets stashed for v1 and v6 UUIDs. How does the following sound to you:

  1. Don't feature gate Timestamp.counter on v1 or v6; just make it always available.

Like the vast majority, I am convinced that only the seventh version is worthy of attention. There is no need to waste time and effort implementing other versions.

  1. Use the counter value if it's non-zero in Uuid::new_v7, otherwise use random bytes.

I'm against. A counter initialized to zero increases the collision probability. Worse, the counter value may be the same as the random value used instead of the counter at the beginning of the millisecond.

I prefer a counter that is initialized to a random value at the beginning of each millisecond. But this may reduce the actual capacity of the counter. Therefore, the leftmost bit of the counter should be initialized to zero and/or the timestamp should be incremented when the counter overflows.

  1. Add a Builder::from_timestamp_millis_counter(millis: u64, counter: u16, random_bytes: [u8; 6]) method.

The timestamp in the seventh version is shorter, and you missed the ver and var segments.
My suggestion:

Segment length, bits Field in RFC Segment content
48 unix_ts_ms Timestamp
4 ver Version
1 rand_a Counter segment initialized to zero
11 rand_a Counter segment initialized with a pseudorandom number
2 var Variant
30 rand_b Counter segment initialized with a pseudorandom number
32 rand_b UUIDv7 segment filled with a pseudorandom number
64 Optional segment to the right of the UUID in key database columns

An alternative for 2. would be to add Uuid::new_v7_counter and Uuid::now_v7_counter methods that use the counter.

If we are talking about initializing the counter every millisecond with a random value and incrementing within a millisecond, then I am for it.

@KodrAus
Copy link
Member

KodrAus commented Jan 16, 2024

This library already implements v1 and v6 versions and has some established API conventions that we need to remain consistency with. I think we should be able to support everything suggested with some new Uuid::new_v7_counter and Uuid::now_v7_counter methods that accept an impl ClockSequence. The implementation of that ClockSequence is what would handle time and randomness.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jan 16, 2024

I highly recommend looking at the UUIDv7 implementation in PostgreSQL v17 by Andrey Borodin

@sergeyprokhorenko
Copy link

@KodrAus
Copy link
Member

KodrAus commented Jan 18, 2024

Thanks @sergeyprokhorenko 👍 Having a good reference implementation to point at will definitely be helpful

@sergeyprokhorenko
Copy link

Another good implementation:
https://github.com/LiosK/uuid7-rs/tree/8b6362ac9bd4b0a24639ebe8365e4f6f6cacec75
https://crates.io/crates/uuid7

It is mentioned in this benchmark.

Due to the excessively long counter (initialized with a random number every millisecond), less entropy is generated, which requires a lot of resources.

But it's worth trying to replace rand::rngs::OsRng with openssl-rand for better performance.

@sergeyprokhorenko
Copy link

Patch UUID v7 (commitfest record)

This patch has already been successfully tested: https://ardentperf.com/2024/02/03/uuid-benchmark-war/

@brianbruggeman
Copy link

Thought/Question: Why not add a builder pattern with defaults? That doesn't eliminate the backwards compatibility and can improve clarity around generating the UUID.

@ehiggs
Copy link

ehiggs commented Apr 14, 2024

The API in the Rust ULID crate provides monotonicity using an increment(&self) -> Option<Ulid> method. This means you can create batches of IDs by generating a single ulid and then increment the random part to create all the batches.

If you know that you are generating a batch then an increment method should be "sufficient logic for organizing the creation order of those one-thousand UUIDs" (quoting from the RFC).

Here's is an example:

use ulid::Ulid;
use uuid::Uuid;

fn main() {
    let mut uuids = Vec::with_capacity(10);
    for _ in 0..10 {
        let uuid = Uuid::now_v7();
        uuids.push(uuid);
    }
    for uuid in uuids.iter() {
        println!("{:<12} {uuid}", "uuidv7:");
    }
    println!("");
    let mut ulids = Vec::with_capacity(10);
    let mut ulid = Ulid::new();
    for _ in 0..10 {
        ulids.push(ulid);
        ulid = ulid.increment().unwrap();
    }
    for ulid in ulids.iter() {
        println!("{:<12} {ulid}", "ulid:");
    }
}

Is taking blatant inspiration from the excellent work in the ULID crate ok? Of course. ULID is the very first referred to time-based unique ID referenced in the new RFC.

Note that this also gives more control over creation of batches. If one creates a batch of 1000 UUIDs, using an internal counter you have no control over whether they will be in the same millisecond. This means you will have part of the batch spread across different timestamps and the random component will jump around for each millisecond. Using an increment you can keep the millisecond and the random component stable. It's not required in the specification but seems useful to be able to trivially see if UUIDs are part of the same batch.

@sergeyprokhorenko
Copy link

If one creates a batch of 1000 UUIDs, using an internal counter you have no control over whether they will be in the same millisecond.

This is not true

@sergeyprokhorenko
Copy link

New functions for generating UUIDv7 with a counter in the ClickHouse DBMS: link

The structure is exactly the same as https://crates.io/crates/uuid7 and https://www.npmjs.com/package/uuidv7

@KodrAus
Copy link
Member

KodrAus commented May 13, 2024

I've started working on an implementation over in #755 that supports respecting the counter value when constructing v7 UUIDs in a way that lets a caller decide how wide they want that counter to be so the guarantees about ordering and uniqueness are configurable. It's still just a draft so will ping once it's ready.

@ehiggs
Copy link

ehiggs commented May 13, 2024

This is not true

How might you allow the caller to control whether the UUIDs are generated for the same millisecond if the timer is internal to the uuid generator?

@sergeyprokhorenko
Copy link

This is not true

How might you allow the caller to control whether the UUIDs are generated for the same millisecond if the timer is internal to the uuid generator?

UUIDs generated in different milliseconds have different timestamp values. UUIDs generated at the same millisecond have the same timestamp values.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented May 14, 2024

@KodrAus

I've started working on an implementation over in #755 that supports respecting the counter value when constructing v7 UUIDs in a way that lets a caller decide how wide they want that counter to be so the guarantees about ordering and uniqueness are configurable. It's still just a draft so will ping once it's ready.

RFC9562 "Universally Unique IDentifiers (UUID)" has already been published and entered into force.

Differences between the implementation for DBMS ClickHouse and the implementation for DBMS PostgreSQL:

  1. Counter is 42 bits, not 18. The counter have no guard bits, every bit is initialized with random number on time ticks.
  2. By default counter is shared between threads. Alternative function generateUUIDv7ThreadMonotonic() provides thread-local counter.

In both implementations in case the counter overflows, the timestamp field is incremented by 1 and the counter is reset to a random new start value (guard bit to zero).

I would also recommend to add the ability to shift the timestamp by the value specified by the formal parameter timestamp_shift of the function uuidv7(timestamp_shift). This will provide at least two benefits:

  • Random shifts of the timestamp at fixed intervals when generating UUIDs will allow to distribute indexes in the database across several pages and thereby reduce the read and write queues
  • Hiding the true time of creation of a record for better security

This is allowed by RFC9562:

Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real-clock value within a UUID to 1) correct inaccurate clocks, 2) handle leap seconds, or 3) obtain a millisecond value by dividing by 1024 (or some other value) for performance reasons (instead of dividing a number of microseconds by 1000). This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time

It is quite difficult to implement the exclusion (in the unpredictable future) of leap seconds required by RFC9562:

UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded.

@KodrAus
Copy link
Member

KodrAus commented May 15, 2024

In both implementations in case the counter overflows, the timestamp field is incremented by 1 and the counter is reset to a random new start value (guard bit to zero).

In order to maintain the invariant that UUIDs are sortable by the order they’re generated in this would need to increment the timestamp for all subsequently generated UUIDs in that millisecond, right?

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented May 15, 2024

In order to maintain the invariant that UUIDs are sortable by the order they’re generated in this would need to increment the timestamp for all subsequently generated UUIDs in that millisecond, right?

Yes that's right. Moreover, this may continue for several consecutive milliseconds of high generation rate until the real time catches up with the timestamp.

But with a guard bit and a counter 18 bits or longer, a counter overflow is an impossible event (until a quantum computer came along).

@sergeyprokhorenko
Copy link

UUIDv7 extension written in RUST, implemented in PostgreSQL

@wdhwg001
Copy link

wdhwg001 commented May 21, 2024

Long story short, I would propose that the only “safe” option for UUID v7 is to use ALL of its rand_a and rand_b (i.e. the entire 74 bits) as a randomly initialized plus-one counter. We should not reduce random bits to make room solely for a dedicated counter.

Here is a detailed explaination:

Due to the Birthday Problem , the length of random bits is crucial to its security requirements, and the frequency matters too. The approximation formula goes like:

$${\displaystyle p(n,d)\approx 1-e^{-{\frac {n^{2}}{2d}}}}$$

Where $n$ stands for the number of random generations (i.e. $n$ people in the class), $d$ is the size of randomization (i.e. $d$ is 365 in birthday problem), and ${\displaystyle p(n,d)}$ is the chance of having a collision.

We can see that $d$'s order of magnitude really matters.

For example, our uuid crate currently generate a UUID v4 within 30ns and UUID v1 in 6ns, which is already super fast (~33,333 V4 UUIDs per ms, ~166,667 V1 UUIDs per ms). Let's assume we went times faster and generated 1,000,000 UUIDs in 1ms, and in a year we would have a whopping $n=31,557,600,000,000,000$ UUIDs.

UUID v4 has 122 random bits in total, which means its "random space size" is $d=2^{122}$, and its chance of having at least one collision in a year is about $0.0093647919$%. However, given the fact it is not timely separated, The chance will be significantly higher. In 20 years it would be $3.676794155524$%, 100 years it would be $60.814604446401$%.

ULID has 80 random bits in total, but it's 1ms separated. Its chance of having a collision in 1ms is about $0.0000000000413558$% (4.1e-13), and the chance of having at least one collision in a year would be $1.2957289570$%, which comes from $1-(1-p(n,d))^{365.25\times 3600 \times 1000}$. In 20 years, it would be $22.9595324907$%.

UUID v7 has 74 random bits, and is also 1ms delimited. Its chance is $0.0000000026469826$% (2.6e-11), and the chance in a year would be $56.6018096684$% , which is already DANGEROUS ENOUGH, as in 20 years it would be $99.9999943839$% 😱.

From the calculation above we may already understand that the random bits are super important, so we should not omit any of them, and we should limit the number of randomizations.

Suppose that we are using the 74 bit random + counter combined approach to generate 1,048,576 v7 UUIDs every millisecond, but we have 1,024 monotonic counters, each of them will generate 1,024 v7 UUIDs every millisecond.

Then, given that we are generating $2^{10}$ UUIDs in a counter, which essentially means we are approximately "eating 10 random bits", so our random bits go down to 64, and we are having 1,024 items in a millisecond.

Thus, in a year we only have $0.0895905465$% chance of encountering a collision, and in 20 years it's only $1.7766423111$%, in 100 years it's only $8.5731240227$%.

If overflow happens, I would say by default we may simply let it overflow and then trim off the new 0b1 bit, but we do need to provide an option to Err out and let the application wait till the next millisecond arrives. This conforms to the "Counter Rollover Handling" in RFC 9562 Section 6.2 .

This approach is not entirely unguessable, but still conforms to the "Monotonic Random" method in RFC 9562 Section 6.2. It is also the tradition how ULID handles monotonic counters while balancing unguessability.

Additionally, it would be purely an evil and insecure approach to add up a random value other than 1, because this will be a significant damage of the random bit size and make the entire method vulnerable to collisions. A +1 counter on random bits inside a millisecond is already very secure and unguessable enough - it is already much more secure than MongoDB's ObjectID (per-second counter) and Twitter's Snowflake (zero started per millisecond counter).

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented May 21, 2024

Long story short...

All of these calculations make the mistaken implicit assumption that all generated UUIDs end up in a common database table where they can collide.

The absurdity of these calculations can be confirmed by the following example: UUIDv7s with a counter generated by one generator will never collide, even if random parts are completely removed from them. This is just as true as integer keys generated by autoincrement do not collide. The random part is necessary for the uniqueness of UUIDs generated by different generators (when merging database tables or at distributed UUID generation).

By the way, the counter is initialized with a random number every millisecond. Therefore, it is incorrect to say that the counter reduces the random part.

@wdhwg001
Copy link

All of these calculations make the mistaken implicit assumption that all generated UUIDs end up in a common database table where they can collide.

This is not a mistaken assumption - the collision resistance analysis itself is based on how many UUIDs are generated in parallel and put together into a single storage collection, it won't exist if mitigations could be taken in downstream services. This is a too-good-to-be-true implication that I dare not to assume.

UUIDv7s with a counter generated by one generator will never collide

This is also the most ideal scenario. In the real world we usually have multiple workers and pods working together, and some of them are potentially lack of time synchronizations. In the worst case, UUIDs are actually generated on mobile/wasm client side - which means collision is still a thing to carefully consider even with the counter.


Essentially I believe we are making a very similar proposal - all rand_a and rand_b bits should be randomly initialized and the counter is working on top of random values. In detail, the random value is refreshed every millisecond, and the counter is a plus-one strategy adding to the random value.

This however is not all methods RFC 9562 suggested - I was trying to analyze all methods in the RFC and conclude the best option we should implement.

Here are all methods mentioned and accepted in the RFC:

  • Fixed Bit-Length Dedicated Counter (Method 1)

    This split the entire 74 bit into two fixed parts: the first part is a dedicated counter, the second part is a dedicated random part. The random part is re-generated for every new UUID. For the dedicated counter part, it includes two strategies:

    • Fixed Bit-Length Dedicated Counter Seeding, where the first part is randomly initialized and increased within 1 millisecond, and re-initialized in the next millisecond.

    • Fixed Bit-Length Dedicated Counter Length, which means we choose 12-42 bits to be used as a high precision timestamp based counter that at least add up 1 for every new UUID.

    No matter which detailed method we choose, this entire method is not ideal, as we both agree that re-generating random part will certainly make it vulnerable to collisions.

  • Monotonic Random (Method 2)

    This reuses the random data as a startpoint of the counter. However, this includes two implementations:

    • Plus 1 for each UUID: this is what we suggested and is very collision-proof, but this slightly damages the unguessability.

    • Plus a random number for each UUID: this actually degrades the level of collision-proof down to being slightly better than naive random. Although being recommended by the RFC spec, I would suggest not to consider this at all.

  • Replace Leftmost Random Bits with Increased Clock Precision (Method 3)

    This separates bits into two - the first >=12-bit part is a high precision clock, and the other bits are used as random values. This is not a collision-resistant way at all as it scarifies random bits, and the high precision timestamp it requires is not cross-platform available.

@sergeyprokhorenko
Copy link

@wdhwg001

The authors of RFC 9562 did not intend to reject any of the contributors' proposals, but simply to combine all of these proposals. This does not mean that all these proposals were justified.

But in real implementations Method 1 is used, which is not at all divided into two opposing strategies, but is single. Your proposal differs only in the absence of the random part generated for each individual UUID. It kills the unguessability without giving anything in return.

@wdhwg001
Copy link

wdhwg001 commented May 23, 2024

which is not at all divided into two opposing strategies, but is single

True. Utilizing high precision timestamp in the counter is not a good option at all, I wouldn't prefer this strategy either in real world implementations.

It kills the unguessability without giving anything in return.

I believe it is the nature of using a counter. No matter how we decide to add per-UUID random bits, it will always put exponential damages to the collision resistance, even just a small amount of bits. As most UUID users use V4 nowadays, they really need the same or at least equivalent collision resistance over years and billions of generations.

In real world scenarios, having randomized entrypoint for each millisecond still provides basic unguessibility. My proposal is exactly how ULID handle counters, and UUID v7 has pretty much the same structure compared to ULID - which means, using all random bits as a randomly initialized plus-one counter is a battle-tested production proof approach.

And in fact, predictability won't be not a concern if you choose not to use the counter at all, and it truly maximized the level of unguessibility. This is also my recommendation in scenarios where sub-millisecond predictability really matters more than avoiding collisions.

Using a counter should always be optional, because there are simply no silver bullets to cope both without compromises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants