New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for UUID type (a.k.a. GUID) #2224
Comments
CC @jskeet who was involved in the discussion. |
Is this still on the roadmap? |
No, this is not on our roadmap. |
Any news? |
Why not? |
Regardless of whether this is on the roadmap or not, I can see two possible designs: Option 1
Option 2
|
the UUID represented by two uint64 values will have problem with endianness: see how-do-i-represent-a-uuid-in-a-protobuf-message discussion. |
doesn't RFC4122 section 4.1.2 present a solution to the problem identified by @kucint ? |
bump |
I think @gmabey is correct in that RFC 4122 section 4.1.2 presents a solution to allow a UUID to be encoded in binary (as opposed to text) and allow the endianness to be handled by the protobuf encoding layer (as opposed to at the application layer). This approach would have a proto-spec like below.
This would be encoded from a Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
TimeLow = BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4));
TimeMid = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2));
TimeHiAndVersion = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2));
ClockSeqHiAndReserved = bytes[8];
ClockSeqLow = bytes[9];
Node = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8)) & 0x0000FFFFFFFFFFFF; ... and decoded as follows. checked
{
Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), TimeLow);
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)TimeMid);
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)TimeHiAndVersion);
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8), Node);
bytes[8] = (byte)ClockSeqHiAndReserved;
bytes[9] = (byte)ClockSeqLow;
return new Guid(bytes);
} I'm keen to get people's thoughts and feedback on this approach. |
I've just done some benchmarking of string-encoded vs little endian byte array-encoded vs RFC 4122-encoded UUIDs in .NET 5 and the results are below.
The aggregate performance of the three approaches is:
So the RFC 4122-based representation is fastest in both serialisation and deserialisation. However, the Note also |
Neither string or byte array are good solutions from a security perspective because they can be abused in certain kinds of DOS or fuzzing attacks. I like the idea of a specific implementation. |
@tdhintz Are you referring to something like message WellKnownUUID {
uint32 w1 = 1;
uint32 w2 = 2;
uint32 w3 = 3;
uint32 w4 = 4;
} There certainly isn't much variability to that structure! |
@gmabey you need to have the 6 UUID elements defined as per RFC 4122 for this approach to work because those 6 elements are all defined as unsigned integers and therefore by defining the message this way, we avoid any endianness issues. For example, I'm assuming the It would of course be possible to specify how to read the two 16-bit values from You would then specify the first This would likely result in a smaller message size (i.e. with 2 fields instead of 6), but carries the inconvenience of the converters having to deal with picking the 6 fields defined by RFC 4122 from the 2 64-bit In the end, this is effectively defining the message as a 16-byte binary buffer and leaving it up to the converters to properly read/write the 6 values defined by RFC 4122 from/to the buffer. |
@gmabey Yes, avoid use of arrays and strings (which really are just a specialized array). |
I've now tested structuring the UUID message as two 64-bit fixed integers. The proto spec is below.
This is encoded from a Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
// MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
High64 = ((ulong)BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4)) << 32) // time_low
| ((ulong)BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2)) << 16) // time_mid
| BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2)); // time_hi_and_version
// MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
Low64 = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8)); It is converted back to a Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), (uint)(High64 >> 32));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)((High64 >> 16) & 0xFFFF));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)(High64 & 0xFFFF));
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8, 8), Low64);
return new Guid(bytes); The The conversion/serialisation/deserialisation benchmarks are below.
Convert & serialise is 29.767 ns and deserialise & convert is 105.698 ns. So this approach is much faster and more efficient on the wire than defining the |
For UUIDv4, MSB is both positive and negative, while the LSB is always negative. So, shouldn't it be |
fixed64 and sfixed64 are the same on the wire. The only difference is how their bits are interpreted by the sending/receiving endpoints. In this case, the bits are interpreted by breaking the 64 bits into the components defined by section 4.1.2 of RFC 4122. That is, the 64 bits are never interpreted as a positive 64-but integer nor a negative 64-bit integer. Therefore, it is fine to encode it either way. However, since the sign is semantically irrelevant, I think it’s better to encode as fixed64. It also makes the code that writes the UUID to the message simpler in .NET. |
Are there any plans for implementing this? I guess we have a lot of good examples and speed tests available so it could be easy integrated. |
We have no plans at this time to integrate this. |
Most folks I have seen simply use a string for this. I don't really see a path forward for this. The cost of adding this as a specific well known type is quite high as compared to having a third_party simply package their preferred proto with some helper functions. |
I certainly just use a string (without curly braces) for this myself. The benefits that seem to be implied were such an effort to be undertaken are: speed, security (I guess), and interoperability. The argument that a third_party function could implement this could be used against most of the data types currently supported by WellKnownTypes -- since you could serialize a date time to string and have a third_party function deserialize it. Please do reply to this message if you dispute any of these points:
|
I would dispute points (3) and (4). Point 3: Some quick searches for code search indicates that Point 4: Getting something cross language tends to nail down a bunch of painful corners in ways that are not helpful. The WellKnownType for time actually causes frequent impedance mismatches with language bindings that have slightly different concepts of time. |
@fowles I don't dispute your rebuttal to Point 3, but I didn't define "very" -- wahoo! :-D Regarding your Point 4 rebuttal -- do you see any "painful corners" associated with UUIDs? Or, are you just complaining about corner cases of Google.Protobuf.WellKnownTypes.Timestamp? (if so, wrong thread ;-) |
I don't know UUIDs particularly well. It is possible they are simpler enough that they won't hit such impedance mismatches. Regardless, point (3) alone is enough for me to continue to feel confident that this doesn't rise to the bar where we want to add it to the core of protobuffer. |
Is protobuf meant for Google's use only? 🤔 Quick check on the node.js world: protobufjs has 7,194,982 weekly downloads, UUID has 59,615,024 weekly downloads (and it's only one implementation). UUID is a standard, rfc4122, and its increasing adoption has been doing wonders to increase interoperability and reliability in various areas. Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID. As for point 4, there would be no impedance mismatches, since it is a standard. Yes, I've been to that place, having code that uses miliseconds since epoch talking to code that uses seconds since epoch, but UUID is UUID, that's the whole point of its existence. |
protobuf is intended for public use, but Google maintains full ownership of it and its evolution. The flip side is that Google also provides the vast majority of the maintenance cost of it. As you note, UUIDs are a standard and libraries exist in most languages to parse them to and from strings. I would advise any group that wants to encode them in protobufs to use a |
This isn't mutually exclusive. UUIDs are able to be freely represented as bytes or encoded as hex values and written to strings (where they a majority of observed use cases show up). adding an explicit UUID type I guess provides...validation as the primary feature request? I dunno what else you really need an explicit type here for. JSON doesn't have a UUID type (hell, at least protobuf has bytes which JSON does not) and nobody stops using JSON for lack of "support". Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here? |
Not using JSON in our project, but also you are right in saying that JSON provides no UUID support, but I would like to add that JSON doesn't add any support for data types like date/times/etc...? From the sounds of things from fowles comment, it seems like a cost thing as Google provides the vast majority of the maintenance costs? Could it just need a financial push in that direction? Anyways, looking forwards to seeing how this might (or not) resolve in the future. I'll continue to use a string or bytes field with the variable name prepended with Guid for now. |
I don't think the lack of UUID support in JSON is reasonable justification for the lack of support in Protobuf. JSON is string (UTF-8) encoded, while Protobuf is binary-encoded. Therefore, the performance penalty of encoding UUIDs to/from their hex-encoded string representations in JSON is expected and therefore acceptable. Conversely, Protobuf is binary encoded and therefore there is an expectation that the performance penalties/overheads of encoding/decoding through strings are avoided. For example, integers are sent/received in binary representation in Protobuf, rather than encoded/decoded as UTF-8 strings. Why is that? JSON encodes integers as strings, so why not Protobuf? The reason is performance and efficiency. JSON and Protobuf are different encodings with different goals and performance characteristics. If that were not the case (i.e., if JSON and Protobuf were completely interchangeable), then why have Protobuf at all? Why doesn't everyone just use JSON instead of Protobuf? There is an opportunity to encode/decode UUIDs in binary form. In fact UUIDs are really just 128-bit integers. Why should 64-bit integers be encoded in binary but 128-bit integers encoded as hex-encoded strings? Byte strings and custom UUID message types are both heap-allocated in the code generated by i.e., support for a UUID well-known type would substantially increase serialisation/deserialisation performance of UUID fields. And isn't that a key reason for using Protobuf over JSON? Performance? |
I do agree with @bill-poole in that validation and performance are the primary drivers for our teams in having native UUID type support in protobuf. Offering a comparison to JSON as a reason for not providing native UUID support in protobuf, I feel, has conflated this conversation a bit. |
FWIW, I just did a scan of Google's internal protos and all fields named "uuid" I've observed are encoded as string or bytes. If this has been good enough for all of Google, I am really wondering if the performance wins we're claiming here are a red herring? Further, what's preventing folks from making message types that encode this as a pair of |
Again, I think this is conflating the discussion. It’s not about “is this good enough for Google and therefore good enough for the broader community” but more a question of efficiency in terms of (IMHO) network and memory serialization/deserialization. Think “durable storage technologies” and why they have native support for UUID types. Optimization. While I am sure that Google is very conscious of optimization, their search infrastructure is less resource limited than many companies. Contrast this to other domains where we are even more attentive to resource constraints. In particular edge IoT, AR/VR for telemedicine, public transportation, defense, etc. My kind request is to not focus on what Google does but to focus on the broader scientific benefit. |
I agree with @mprimeaux.
I guess it depends on how much of a performance penalty would be deemed by Google to be sufficient to warrant doing something about it. How much slower is too much slower? I did benchmarking a while back for encoding UUIDs as strings and byte arrays in Protobuf and posted the results earlier in this thread (see #2224 (comment)).
Nothing, which is what I did and turned out to be the fastest option available without defining a well-known type that can be serialised/deserialised directly from/to a "primitive" UUID type (see #2224 (comment)). The point is that it requires converting through an intermediate heap-allocated type (with the two |
I code in .Net 4.8, so Guid does not exists method |
One way would be to use the You could instead try defining your own
When you instantiate a If this works (and I haven't confirmed that it does/will), then it will be faster than using the You should then be able to populate the |
One thing I haven’t seen spelt out here is the relative sizes of the different options, including the overhead of having the data in a child message which would be needed for a well-known type. For comparison (and fun) I also included a hypothetical native 128-bit fixed width type (which could be added as there are still 3 possible wire type numbers left!).
By the way, JSON was brought up almost as an argument against well known type for UUID. But actually I see it as the strongest reason in favour of UUID well-known type. The trade off is between:
I haven’t included the message with separate RFC 4122 fields as it’s misguided in my view (and, with all those varints, would be a nightmare for me to compute the size). I also dispute the snippet above where various bits of two 64-bits numbers are sliced up with different endianness. Both of those fail to recognise that a UUID is simply a sequence of 16 bytes and nothing more. There is no possible endianness issue with that. It could have been generated as multiple fields by RFC 4122, in which case care must be taken with endianness when converting those fields to or from the byte sequence, but that’s not the serialisation layer’s problem. |
I agree and found there was a significant performance penalty for doing so.
There can actually be endianness issues with UUIDs. Microsoft frameworks (e.g. .NET) tend to represent UUIDs in little endian format in memory, whereas RFC 4122 recommends big endian binary representation for network transmission. I think big endian representation is therefore the correct/best representation for Protobuf, but it means that Microsoft frameworks like .NET need to convert between little endian and big endian (which can be done with a SIMD shuffle instruction). Note that I posted the results of performance testing on .NET for My performance testing for little endian
My performance testing for big endian two-
Based on the above results, the I imagine than a WKT based on a native 128-bit field would be the simplest and most performant representation. However in the absence of a native 128-bit type, I think a two- |
But those tests are for C# / .Net, as you said. For UUID to be a WKT it has to make sense for all languages, and actually C# is one of the less used languages for protobuf (and certainly not why I'm here). People picking apart nanosecond-level performance are more likely to be using C++. Also, those tests assume that every protobuf UUID field will be converted to the language's native UUID type when deserialised, but I think this would be a small minority usage. Much of the time it would just be used directly as a byte array, regardless of language. Obviously, having the data already in bytes format is most convenient for this. The two comments above about Google's code seems to support this. None of the comments here, except yours, have focused on conversion to C#'s GUID type, especially performance of it. I do agree conversion methods should exist (in all languages where they make sense), but they shouldn't be the focus of the discussion. Using two int64 members would be super confusing - you've basically invented your own new representation for UUIDs, and the existing selection is already confusing enough! It sounds like I'm backtracking in my support for a UUID well known type - why don't I just use bytes if that's what I want? But, like I said in my previous comment, a well known type is still useful because it allows you to effectively communicate that this field is a UUID (rather than just code comment saying so, or your own custom UUID message) and it gives you the standard JSON string representation. |
@JamesOldfield, I provided the performance testing results for .NET because those were the results I had previously posted on this topic that I thought were relevant to what you said. I would be very interested to see how the performance compares between the various options in C++ and other platforms. I very much encourage that testing to be done. I hypothesise that a similar performance difference between the options will be seen across multiple platforms. If it turns out that At the very least, I don't think it would be prudent to assume that a
I don't think that's true. RFC 4122 specifies the 128-bit layout, and every 128-bit value comprises a high 64-bit and low 64-bit value. i.e., the only complexity a two |
At this time doesn't exist a native support to UUID type, so we used a string type. [Issue]: protocolbuffers/protobuf#2224 Signed-off-by: Antonio Gisondi <antonio.gisondi@secomind.com>
Thanks for this thread of discussion providing real-world solution for people solving real-world problems in the here-and-now. At least I don't have to wait for the governance of protobuf to get down to earth... BTW: looks that I will use the 2 x fixed64 solution ( #2224 ) to which principles I've tended before going deeper into the protobuf rabbid-hole of official definitions... Thanks bill-poole providing that & showing up the performance of it. |
What.... an open issue since 2k16 |
We use v4 UUIDs quite a bit in our AI / ML "workloads" and literally every bit counts "over the wire and on disk", which I think @bill-poole addressed in part in his experiments above. There's been spirited conversation with a diversity of positions, which has made for a healthy set of discussions. My sincerely hope is we find a way forward to have this supported as an intrinsic type. |
What the fxxk?This issue continue in 2023??Google must support UUID/GUID! We all want a officly solution! Not endless discussions! |
ah open since 2016, hope 2024 gives this a kick somewhere... |
In our Dreams |
Waiting for the anniversary |
Filing on behalf of a customer:
Protobuf lacks Uuid (Guid in .NET) support out of the box. It would have been nice to have a Well-Known Type (like we do with Timestamp to represent Date and Times) since Uuids are pretty common, particularly in distributed systems.
The text was updated successfully, but these errors were encountered: