Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide support for UUID type (a.k.a. GUID) #2224

Open
jtattermusch opened this issue Oct 6, 2016 · 48 comments
Open

Provide support for UUID type (a.k.a. GUID) #2224

jtattermusch opened this issue Oct 6, 2016 · 48 comments

Comments

@jtattermusch
Copy link
Contributor

Filing on behalf of a customer:
Protobuf lacks Uuid (Guid in .NET) support out of the box. It would have been nice to have a Well-Known Type (like we do with Timestamp to represent Date and Times) since Uuids are pretty common, particularly in distributed systems.

@jtattermusch
Copy link
Contributor Author

CC @jskeet who was involved in the discussion.

@DanFTRX
Copy link

DanFTRX commented Aug 2, 2018

Is this still on the roadmap?

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Aug 2, 2018

No, this is not on our roadmap.

@listepo
Copy link

listepo commented Aug 26, 2019

Any news?

@mihaimyh
Copy link

mihaimyh commented Nov 1, 2019

No, this is not on our roadmap.

Why not?

@jtattermusch
Copy link
Contributor Author

Regardless of whether this is on the roadmap or not, I can see two possible designs:

Option 1

// Message representing a version 4 universally unique identifier. See
// rfc/4122#section-4.4 for additional information.
message UUID {
  // The two int64s below, should be populated with the most and least
  // significant 64 bits of a version 4 UUID.
  // (e.g., https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html).
  uint64 most_significant_uuid_bits = 1
  uint64 least_significant_uuid_bits = 2
}

Option 2

message UUID {
  string value = 1
}

@kucint
Copy link

kucint commented Jan 3, 2020

the UUID represented by two uint64 values will have problem with endianness: see how-do-i-represent-a-uuid-in-a-protobuf-message discussion.

@gmabey
Copy link

gmabey commented Feb 19, 2020

doesn't RFC4122 section 4.1.2 present a solution to the problem identified by @kucint ?

@onesteveo
Copy link

bump

@bill-poole
Copy link

I think @gmabey is correct in that RFC 4122 section 4.1.2 presents a solution to allow a UUID to be encoded in binary (as opposed to text) and allow the endianness to be handled by the protobuf encoding layer (as opposed to at the application layer). This approach would have a proto-spec like below.

// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
	// The low field of the timestamp (32 bits).
	fixed32 time_low = 1;

	// The middle field of the timestamp (16 bits).
	uint32 time_mid = 2;

	// The high field of the timestamp multiplexed with the version number (16 bits).
	uint32 time_hi_and_version = 3;

	// The high field of the clock sequence multiplexed with the variant (8 bits).
	uint32 clock_seq_hi_and_reserved = 4;

	// The low field of the clock sequence (8 bits).
	uint32 clock_seq_low = 5;

	// The spatially unique node identifier (48 bits).
	uint64 node = 6;
}

This would be encoded from a System.Guid in .NET/C# as follows.

Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
TimeLow = BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4));
TimeMid = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2));
TimeHiAndVersion = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2));
ClockSeqHiAndReserved = bytes[8];
ClockSeqLow = bytes[9];
Node = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8)) & 0x0000FFFFFFFFFFFF;

... and decoded as follows.

checked
{
	Span<byte> bytes = stackalloc byte[16];
	BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), TimeLow);
	BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)TimeMid);
	BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)TimeHiAndVersion);
	BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8), Node);
	bytes[8] = (byte)ClockSeqHiAndReserved;
	bytes[9] = (byte)ClockSeqLow;
	return new Guid(bytes);
}

I'm keen to get people's thoughts and feedback on this approach.

@bill-poole
Copy link

I've just done some benchmarking of string-encoded vs little endian byte array-encoded vs RFC 4122-encoded UUIDs in .NET 5 and the results are below.

Method Mean Error StdDev
ConvertToStringUuid 52.498 ns 0.2467 ns 0.2308 ns
ConvertToLittleEndianBinaryUuid 64.575 ns 0.2746 ns 0.2293 ns
ConvertToRfc4122Uuid 10.849 ns 0.0662 ns 0.0620 ns
SerialiseStringUuid 54.187 ns 0.4128 ns 0.5511 ns
SerialiseLittleEndianByteArrayUuid 25.091 ns 0.1539 ns 0.1364 ns
SerialiseRfc4122Uuid 61.701 ns 0.4367 ns 0.4085 ns
DeserialiseStringUuid 224.519 ns 1.2474 ns 1.1058 ns
DeserialiseLittleEndianByteArrayUuid 217.149 ns 0.9792 ns 0.8177 ns
DeserialiseRfc4122Uuid 150.113 ns 0.4288 ns 0.3801 ns
ConvertFromStringUuid 82.985 ns 0.2628 ns 0.2459 ns
ConvertFromLittleEndianByteArrayUuid 2.475 ns 0.0123 ns 0.0115 ns
ConvertFromRfc4122Uuid 9.682 ns 0.0210 ns 0.0186 ns

The aggregate performance of the three approaches is:

  • Convert and serialise StringUuid: 106.685 ns
  • Convert and serialise LittleEndianBinaryUuid: 89.666 ns
  • Convert and serialise Rfc4122Uuid: 72.55 ns
  • Deserialise and convert StringUuid: 307.504 ns
  • Deserialise and convert LittleEndianBinaryUuid: 219.624 ns
  • Deserialise and convert Rfc4122Uuid: 159.795 ns

So the RFC 4122-based representation is fastest in both serialisation and deserialisation.

However, the StringUuid serialises to 38 bytes, the LittleEndianBinaryUuid to 18 bytes and the Rfc4122Uuid 27 bytes - according to the CalculateSize() method on each message type. So, while the RFC 4122-based encoding is faster, it is 50% larger than the little endian binary encoding on the wire.

Note also ByteString.UnsafeWrap (see #7645) will improve the ConvertToLittleEndianBinaryUuid performance when it is available.

@tdhintz
Copy link

tdhintz commented Jan 12, 2021

Neither string or byte array are good solutions from a security perspective because they can be abused in certain kinds of DOS or fuzzing attacks. I like the idea of a specific implementation.

@gmabey
Copy link

gmabey commented Jan 13, 2021

@tdhintz Are you referring to something like

message WellKnownUUID {
    uint32 w1 = 1;
    uint32 w2 = 2;
    uint32 w3 = 3;
    uint32 w4 = 4;
}

There certainly isn't much variability to that structure!
Perhaps @billpoole-mi would be kind enough to benchmark that approach?

@bill-poole
Copy link

@gmabey you need to have the 6 UUID elements defined as per RFC 4122 for this approach to work because those 6 elements are all defined as unsigned integers and therefore by defining the message this way, we avoid any endianness issues.

For example, I'm assuming the w2 element in your WellKnownUUID message would correspond to the time_mid and time_hi_and_version RFC 4122 fields, but it isn't specified whether the high 16 bits are the time_hi_and_version or the low 16 bits.

It would of course be possible to specify how to read the two 16-bit values from w2 as part of the documentation of the WellKnownUUID message such that the converters to/from this type on each platform do so correctly. But if you're willing to move the responsibility for this into the converters, you might as well go all the way with it and define the message with two ulong fields.

You would then specify the first ulong value is time_low in the high 32 bits of the high 64-bit field, then time_mid in the high 16 bits of the low 32 bits of the high 64-bit field and time_hi_and_version in the low 16 bits of the high 64-bit field. You'd apply similar logic to the low 64-bit field.

This would likely result in a smaller message size (i.e. with 2 fields instead of 6), but carries the inconvenience of the converters having to deal with picking the 6 fields defined by RFC 4122 from the 2 64-bit ulong fields.

In the end, this is effectively defining the message as a 16-byte binary buffer and leaving it up to the converters to properly read/write the 6 values defined by RFC 4122 from/to the buffer.

@tdhintz
Copy link

tdhintz commented Jan 14, 2021

@gmabey Yes, avoid use of arrays and strings (which really are just a specialized array).

@bill-poole
Copy link

I've now tested structuring the UUID message as two 64-bit fixed integers. The proto spec is below.

// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
	// The high 64 bits of the UUID - MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
	fixed64 high64 = 1;

	// The low 64 bits of the UUID - MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
	fixed64 low64 = 2;
}

This is encoded from a System.Guid in .NET as follows.

Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);

// MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
High64 = ((ulong)BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4)) << 32) // time_low
	| ((ulong)BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2)) << 16) // time_mid
	| BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2)); // time_hi_and_version

// MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
Low64 = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8));

It is converted back to a System.Guid as follows.

Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), (uint)(High64 >> 32));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)((High64 >> 16) & 0xFFFF));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)(High64 & 0xFFFF));
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8, 8), Low64);
return new Guid(bytes);

The Uuid message size is 18 bytes (as opposed to 27 bytes when defining the Uuid message with the 6 individual fields defined by RFC 4122).

The conversion/serialisation/deserialisation benchmarks are below.

Method Mean Error StdDev
ConvertToUuid 8.825 ns 0.0665 ns 0.0622 ns
SerialiseUuid 20.942 ns 0.0891 ns 0.0790 ns
DeserialiseUuid 96.178 ns 0.3735 ns 0.3494 ns
ConvertFromUuid 9.520 ns 0.0578 ns 0.0541 ns

Convert & serialise is 29.767 ns and deserialise & convert is 105.698 ns.

So this approach is much faster and more efficient on the wire than defining the Uuid message with the 6 fields defined by RFC 4122.

@singhbaljit
Copy link

For UUIDv4, MSB is both positive and negative, while the LSB is always negative. So, shouldn't it be sfixed64?

@bill-poole
Copy link

fixed64 and sfixed64 are the same on the wire. The only difference is how their bits are interpreted by the sending/receiving endpoints.

In this case, the bits are interpreted by breaking the 64 bits into the components defined by section 4.1.2 of RFC 4122.

That is, the 64 bits are never interpreted as a positive 64-but integer nor a negative 64-bit integer. Therefore, it is fine to encode it either way.

However, since the sign is semantically irrelevant, I think it’s better to encode as fixed64. It also makes the code that writes the UUID to the message simpler in .NET.

@AtosNicoS
Copy link

Are there any plans for implementing this? I guess we have a lot of good examples and speed tests available so it could be easy integrated.

@perezd
Copy link
Contributor

perezd commented Jul 29, 2021

We have no plans at this time to integrate this.

@fowles
Copy link
Member

fowles commented Apr 18, 2022

Most folks I have seen simply use a string for this.

I don't really see a path forward for this. The cost of adding this as a specific well known type is quite high as compared to having a third_party simply package their preferred proto with some helper functions.

@gmabey
Copy link

gmabey commented Apr 18, 2022

I certainly just use a string (without curly braces) for this myself. The benefits that seem to be implied were such an effort to be undertaken are: speed, security (I guess), and interoperability. The argument that a third_party function could implement this could be used against most of the data types currently supported by WellKnownTypes -- since you could serialize a date time to string and have a third_party function deserialize it.

Please do reply to this message if you dispute any of these points:

  1. UUID is well known.
  2. UUID is well defined.
  3. UUID is very common. (in my world haha)
  4. Protobuf messages would be "more well defined" (sigh, I wish there was a better way to say that) if a first-class (hrmm, I guess WellKnownTypes are second-class citizens) data type existed that standardized serialization/deserialization and conversion to/from platform specific classes (like in python).
  5. Point (4.) would make including UUIDs (as a member of a message) less error prone.

@fowles
Copy link
Member

fowles commented Apr 18, 2022

I would dispute points (3) and (4).

Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.

Point 4: Getting something cross language tends to nail down a bunch of painful corners in ways that are not helpful. The WellKnownType for time actually causes frequent impedance mismatches with language bindings that have slightly different concepts of time.

@gmabey
Copy link

gmabey commented Apr 18, 2022

@fowles I don't dispute your rebuttal to Point 3, but I didn't define "very" -- wahoo! :-D

Regarding your Point 4 rebuttal -- do you see any "painful corners" associated with UUIDs? Or, are you just complaining about corner cases of Google.Protobuf.WellKnownTypes.Timestamp? (if so, wrong thread ;-)

@fowles
Copy link
Member

fowles commented Apr 18, 2022

I don't know UUIDs particularly well. It is possible they are simpler enough that they won't hit such impedance mismatches. Regardless, point (3) alone is enough for me to continue to feel confident that this doesn't rise to the bar where we want to add it to the core of protobuffer.

@lalomartins
Copy link

Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.

Is protobuf meant for Google's use only? 🤔

Quick check on the node.js world: protobufjs has 7,194,982 weekly downloads, UUID has 59,615,024 weekly downloads (and it's only one implementation). UUID is a standard, rfc4122, and its increasing adoption has been doing wonders to increase interoperability and reliability in various areas.

Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.

As for point 4, there would be no impedance mismatches, since it is a standard. Yes, I've been to that place, having code that uses miliseconds since epoch talking to code that uses seconds since epoch, but UUID is UUID, that's the whole point of its existence.

@fowles
Copy link
Member

fowles commented Apr 18, 2022

protobuf is intended for public use, but Google maintains full ownership of it and its evolution. The flip side is that Google also provides the vast majority of the maintenance cost of it.

As you note, UUIDs are a standard and libraries exist in most languages to parse them to and from strings. I would advise any group that wants to encode them in protobufs to use a bytes field. If you are starting a new project and are unwilling to accept that trade off, that is a totally reasonable choice for you to make.

@perezd
Copy link
Contributor

perezd commented Apr 18, 2022

Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.

This isn't mutually exclusive. UUIDs are able to be freely represented as bytes or encoded as hex values and written to strings (where they a majority of observed use cases show up).

adding an explicit UUID type I guess provides...validation as the primary feature request? I dunno what else you really need an explicit type here for. JSON doesn't have a UUID type (hell, at least protobuf has bytes which JSON does not) and nobody stops using JSON for lack of "support".

Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?

@kibblewhite
Copy link

Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?

Not using JSON in our project, but also you are right in saying that JSON provides no UUID support, but I would like to add that JSON doesn't add any support for data types like date/times/etc...?

From the sounds of things from fowles comment, it seems like a cost thing as Google provides the vast majority of the maintenance costs? Could it just need a financial push in that direction?

Anyways, looking forwards to seeing how this might (or not) resolve in the future. I'll continue to use a string or bytes field with the variable name prepended with Guid for now.
Thanks to everyone for the input, it's been insightful.

@bill-poole
Copy link

JSON doesn't have a UUID type (hell, at least protobuf has bytes which JSON does not) and nobody stops using JSON for lack of "support".

I don't think the lack of UUID support in JSON is reasonable justification for the lack of support in Protobuf. JSON is string (UTF-8) encoded, while Protobuf is binary-encoded. Therefore, the performance penalty of encoding UUIDs to/from their hex-encoded string representations in JSON is expected and therefore acceptable.

Conversely, Protobuf is binary encoded and therefore there is an expectation that the performance penalties/overheads of encoding/decoding through strings are avoided. For example, integers are sent/received in binary representation in Protobuf, rather than encoded/decoded as UTF-8 strings. Why is that? JSON encodes integers as strings, so why not Protobuf? The reason is performance and efficiency.

JSON and Protobuf are different encodings with different goals and performance characteristics. If that were not the case (i.e., if JSON and Protobuf were completely interchangeable), then why have Protobuf at all? Why doesn't everyone just use JSON instead of Protobuf?

There is an opportunity to encode/decode UUIDs in binary form. In fact UUIDs are really just 128-bit integers. Why should 64-bit integers be encoded in binary but 128-bit integers encoded as hex-encoded strings?

Byte strings and custom UUID message types are both heap-allocated in the code generated by protoc. Messages must be encoded/decoded as these intermediate heap-allocated objects, and then serialised/deserialised. If Protobuf had a well-known type for UUID, then these intermediate heap-allocated objects would no longer be required, and messages could use a "primitive" 128-bit type, which would save the heap allocation and the translation through the intermediate format.

i.e., support for a UUID well-known type would substantially increase serialisation/deserialisation performance of UUID fields. And isn't that a key reason for using Protobuf over JSON? Performance?

@mprimeaux
Copy link

mprimeaux commented Apr 20, 2022

I do agree with @bill-poole in that validation and performance are the primary drivers for our teams in having native UUID type support in protobuf.

Offering a comparison to JSON as a reason for not providing native UUID support in protobuf, I feel, has conflated this conversation a bit.

@perezd
Copy link
Contributor

perezd commented Apr 20, 2022

There is an opportunity to encode/decode UUIDs in binary form. In fact UUIDs are really just 128-bit integers. Why should 64-bit integers be encoded in binary but 128-bit integers encoded as hex-encoded strings?

FWIW, I just did a scan of Google's internal protos and all fields named "uuid" I've observed are encoded as string or bytes. If this has been good enough for all of Google, I am really wondering if the performance wins we're claiming here are a red herring?

Further, what's preventing folks from making message types that encode this as a pair of sfixed64 numbers? I think this would also mitigate the allocation concerns, no?

@mprimeaux
Copy link

mprimeaux commented Apr 20, 2022

Again, I think this is conflating the discussion.

It’s not about “is this good enough for Google and therefore good enough for the broader community” but more a question of efficiency in terms of (IMHO) network and memory serialization/deserialization.

Think “durable storage technologies” and why they have native support for UUID types. Optimization.

While I am sure that Google is very conscious of optimization, their search infrastructure is less resource limited than many companies.

Contrast this to other domains where we are even more attentive to resource constraints. In particular edge IoT, AR/VR for telemedicine, public transportation, defense, etc.

My kind request is to not focus on what Google does but to focus on the broader scientific benefit.

@bill-poole
Copy link

I agree with @mprimeaux.

FWIW, I just did a scan of Google's internal protos and all fields named "uuid" I've observed are encoded as string or bytes. If this has been good enough for all of Google, I am really wondering if the performance wins we're claiming here are a red herring?

I guess it depends on how much of a performance penalty would be deemed by Google to be sufficient to warrant doing something about it. How much slower is too much slower? I did benchmarking a while back for encoding UUIDs as strings and byte arrays in Protobuf and posted the results earlier in this thread (see #2224 (comment)).

Further, what's preventing folks from making message types that encode this as a pair of sfixed64 numbers? I think this would also mitigate the allocation concerns, no?

Nothing, which is what I did and turned out to be the fastest option available without defining a well-known type that can be serialised/deserialised directly from/to a "primitive" UUID type (see #2224 (comment)). The point is that it requires converting through an intermediate heap-allocated type (with the two fixed64 integers), which is much slower than it would otherwise be to serialise/deserialise directly from/to a "primitive" UUID type.

@ghost
Copy link

ghost commented May 10, 2022

I code in .Net 4.8, so Guid does not exists method guid.TryWriteBytes(bytes); . What is the solution to this problem?

@bill-poole
Copy link

bill-poole commented May 13, 2022

One way would be to use the Guid.ToByteArray method instead of the Guid.TryWriteBytes(Span<byte>) method. However, that will heap-allocate an array each time you invoke it, which will create more GC pressure.

You could instead try defining your own DecodedGuid struct, which is decorated with [StructLayout(LayoutKind.Explicit)] and has:

  • a Guid field decorated with [FieldOffset(0)]; and
  • two ulong fields Low64 and High64 decorated with [FieldOffset(0)] and [FieldOffset(8)] respectively.

When you instantiate a DecodedGuid struct, the Guid constructor parameter will then be written to the Guid field, and the low and high 64-bit unsigned integer components can be read from the Low64 and High64 fields.

If this works (and I haven't confirmed that it does/will), then it will be faster than using the Guid.ToByteArray method because the Guid contents will be copied into a stack-allocated SerializedGuid, rather than a heap-allocated array.

You should then be able to populate the High64 and Low64 fields of the Uuid Protobuf instance as per #2224 (comment) from the High64 and Low64 fields of the DecodedGuid struct.

@JamesOldfield
Copy link

JamesOldfield commented Jul 14, 2022

One thing I haven’t seen spelt out here is the relative sizes of the different options, including the overhead of having the data in a child message which would be needed for a well-known type. For comparison (and fun) I also included a hypothetical native 128-bit fixed width type (which could be added as there are still 3 possible wire type numbers left!).

What Size Calculation
Native 128-bit field 17 1 tag + 16 payload
Native 128-bit (in message) 19 1 tag + 1 size + (1 tag + 16 payload)
Bytes field 18 1 tag + 1 size + 16 payload
Bytes (in message) 20 1 tag + 1 size + (1 tag + 1 size + 16 payload)
String field 38 1 tag + 1 size + 36 payload
String (in message) 40 1 tag + 1 size + (1 tag + 1 size + 36 payload)
Two fixed64 (in message) 20 1 tag + 1 size + 2 × (1 tag + 8 payload)

By the way, JSON was brought up almost as an argument against well known type for UUID. But actually I see it as the strongest reason in favour of UUID well-known type. The trade off is between:

  • string: nice JSON string representation but huge binary encoding (38 bytes) – given the great pains protobuf goes to for binary compactness (varints, zigzag encoding, and packed tags), it doesn't make sense to use something so wasteful
  • bytes: compact binary encoding (18 bytes) but bonkers JSON representation (e.g. UUID "12345678-1234-5678-1234-567812345678" becomes "EjRWeBI0VngSNFZ4EjRWeA==")
  • WKT based on bytes: decent length binary encoding (20 bytes) and nice JSON string representation

I haven’t included the message with separate RFC 4122 fields as it’s misguided in my view (and, with all those varints, would be a nightmare for me to compute the size). I also dispute the snippet above where various bits of two 64-bits numbers are sliced up with different endianness. Both of those fail to recognise that a UUID is simply a sequence of 16 bytes and nothing more. There is no possible endianness issue with that. It could have been generated as multiple fields by RFC 4122, in which case care must be taken with endianness when converting those fields to or from the byte sequence, but that’s not the serialisation layer’s problem.

@bill-poole
Copy link

I haven’t included the message with separate RFC 4122 fields as it’s misguided in my view

I agree and found there was a significant performance penalty for doing so.

Both of those fail to recognise that a UUID is simply a sequence of 16 bytes and nothing more. There is no possible endianness issue with that.

There can actually be endianness issues with UUIDs. Microsoft frameworks (e.g. .NET) tend to represent UUIDs in little endian format in memory, whereas RFC 4122 recommends big endian binary representation for network transmission. I think big endian representation is therefore the correct/best representation for Protobuf, but it means that Microsoft frameworks like .NET need to convert between little endian and big endian (which can be done with a SIMD shuffle instruction).

Note that I posted the results of performance testing on .NET for bytes versus two-fixed64 representations earlier in this issue.

My performance testing for little endian bytes representation on .NET here:

  • Convert and serialise: 89.666 ns
  • Deserialise and convert: 219.624 ns

My performance testing for big endian two-fixed64 representation on .NET here:

  • Convert and serialise: 29.767 ns
  • Deserialise and convert: 107.698 ns

Based on the above results, the bytes representation is much slower than the two-fixed64 representation.

I imagine than a WKT based on a native 128-bit field would be the simplest and most performant representation. However in the absence of a native 128-bit type, I think a two-fixed64 WKT is best due to its performance advantage over the bytes representation.

@JamesOldfield
Copy link

@bill-poole

But those tests are for C# / .Net, as you said. For UUID to be a WKT it has to make sense for all languages, and actually C# is one of the less used languages for protobuf (and certainly not why I'm here). People picking apart nanosecond-level performance are more likely to be using C++.

Also, those tests assume that every protobuf UUID field will be converted to the language's native UUID type when deserialised, but I think this would be a small minority usage. Much of the time it would just be used directly as a byte array, regardless of language. Obviously, having the data already in bytes format is most convenient for this. The two comments above about Google's code seems to support this. None of the comments here, except yours, have focused on conversion to C#'s GUID type, especially performance of it. I do agree conversion methods should exist (in all languages where they make sense), but they shouldn't be the focus of the discussion.

Using two int64 members would be super confusing - you've basically invented your own new representation for UUIDs, and the existing selection is already confusing enough!

It sounds like I'm backtracking in my support for a UUID well known type - why don't I just use bytes if that's what I want? But, like I said in my previous comment, a well known type is still useful because it allows you to effectively communicate that this field is a UUID (rather than just code comment saying so, or your own custom UUID message) and it gives you the standard JSON string representation.

@bill-poole
Copy link

@JamesOldfield, I provided the performance testing results for .NET because those were the results I had previously posted on this topic that I thought were relevant to what you said. I would be very interested to see how the performance compares between the various options in C++ and other platforms. I very much encourage that testing to be done.

I hypothesise that a similar performance difference between the options will be seen across multiple platforms. If it turns out that bytes WKTs are really slow in .NET for some reason (compared to other platforms), then I expect that would provide strong motivation for the .NET implementation to improve its performance in this area.

At the very least, I don't think it would be prudent to assume that a bytes representation is faster or as fast as a two fixed64 field representation in C++ (or any other platform) without doing the requisite performance testing.

Using two int64 members would be super confusing - you've basically invented your own new representation for UUIDs

I don't think that's true. RFC 4122 specifies the 128-bit layout, and every 128-bit value comprises a high 64-bit and low 64-bit value. i.e., the only complexity a two fixed64 field representation introduces is the concept of a 128-bit value being decomposed into a high 64-bit value and a low 64-bit value.

harlem88 added a commit to harlem88/astarte-message-hub that referenced this issue Sep 21, 2022
At this time doesn't exist a native support to UUID type,
so we used a string type.
[Issue]: protocolbuffers/protobuf#2224

Signed-off-by: Antonio Gisondi <antonio.gisondi@secomind.com>
@minesworld
Copy link

minesworld commented Feb 18, 2023

Thanks for this thread of discussion providing real-world solution for people solving real-world problems in the here-and-now. At least I don't have to wait for the governance of protobuf to get down to earth... BTW: looks that I will use the 2 x fixed64 solution ( #2224 ) to which principles I've tended before going deeper into the protobuf rabbid-hole of official definitions... Thanks bill-poole providing that & showing up the performance of it.

@AbdulRehman-z
Copy link

What.... an open issue since 2k16

@mprimeaux
Copy link

mprimeaux commented Oct 10, 2023

We use v4 UUIDs quite a bit in our AI / ML "workloads" and literally every bit counts "over the wire and on disk", which I think @bill-poole addressed in part in his experiments above.

There's been spirited conversation with a diversity of positions, which has made for a healthy set of discussions. My sincerely hope is we find a way forward to have this supported as an intrinsic type.

@BoysheO
Copy link

BoysheO commented Nov 13, 2023

What the fxxk?This issue continue in 2023??Google must support UUID/GUID! We all want a officly solution! Not endless discussions!

@alikleit
Copy link

ah open since 2016, hope 2024 gives this a kick somewhere...

@AbdulRehman-z
Copy link

ah open since 2016, hope 2024 gives this a kick somewhere...

In our Dreams

@zs-dima
Copy link

zs-dima commented Jan 13, 2024

ah open since 2016, hope 2024 gives this a kick somewhere...

Waiting for the anniversary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests