Discussion on CloudEvent data and transcoding #1204

jskeet · 2023-05-10T11:36:17Z

This "issue" is to record discussions/thoughts on the nature of CloudEvent data, in the hope that it will help us to resolve #1186.

Each example is numbered for ease of reference later. The terms "data" and "payload" are used interchangeably, as sometimes this is helpful for disambiguation.

What is the data/payload of a CloudEvent?

The spec is deliberately hands-off about the nature of CloudEvent data:

As defined by the term Data, CloudEvents MAY include domain-specific information about the occurrence. When present, this information will be encapsulated within data.

Description: The event payload. This specification does not place any restriction on the type of this information. It is encoded into a media format which is specified by the datacontenttype attribute (e.g. application/json), and adheres to the dataschema format when those respective attributes are present.

(As an aside "encapsulated within data" doesn't really mean much now that data isn't an attribute. We should do some clean-up at some point.)

So the payload of a CloudEvent is generally opaque. It must be representable as a sequence of bytes, in order to be represented in binary mode in HTTP at least. ("Binary mode" doesn't define what a "message body" is, but in HTTP we need to be able to encode it as bytes. "Message bodies" for other transports may require text, which presumably means that binary mode for those transports has to specify how non-text CloudEvent data would be represented in the body.)

While it may sound like a truism that "data must be representable as a sequence of bytes", it's not entirely straightforward, as it requires that a serialized representation be chosen. When a CloudEvent is created within an event producer, the data that's intended to be represented (for example "an object in memory") may well not have a single "natural" serialized form. (There may be multiple representations available, or one may need to be created just for the purpose of encoding the data as a CloudEvent.)

There's also room for some interpretation when it comes to "It is encoded into a media format which is specified by the datacontenttype attribute". What guarantees/constraints are present in terms of the validity of that encoding?

CloudEvent validity

Let's consider the following structured-mode even in the JSON format, which is only changed very slightly from an example in the JSON format spec:

Example 1: Invalid XML in JSON-formatted event

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "id" : "B234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "unsetextension": null,
    "datacontenttype" : "application/xml",
    "data" : "<mixer manufacturer=\"Allen & Heath\" />"
}

The same event could certainly be represented in binary mode, e.g. in HTTP, where the message body would be the UTF-8-encoded bytes of:

<mixer manufacturer="Allen & Heath" />

Is this a valid CloudEvent? Should it be accepted or rejected by processors?

It's fine in every way except one: the data isn't valid for the declared content type, because the & isn't escaped.

Jon's opinion: this is valid at the "CloudEvents spec" level, but invalid at a data-processor level. It would be reasonable for a CloudEvent processor which tried to use the data to reject it.

Rationale:

We shouldn't require data validation in processors:
- In many cases it would be prohibitively expensive
- In some cases it may be infeasible:
  - The data content type may be some new content type that the processor isn't aware of. (It may even be private.)
  - The data may be encrypted, so that only after decryption could you validate the "real" data, and the processor may not be able to decrypt.
We shouldn't prohibit data validation in processors:
- A processor may reasonably require valid data, such that it can't carry on in the face of invalid data. While a processor could just ignore the invalid CloudEvent, that's usually going to lead to a worse (and harder-to-diagnose) outcome than rejecting it.

There are many ways in which event data may be invalid in an application-specific way, beyond the content-type-level validity shown above. (Imagine database constraints being violated, for example.)

Note that this is entirely separate from spec-level-invalid or format-level-invalid CloudEvents.

Example 2: Invalid JSON-formatted event (empty id)

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "id" : "",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "unsetextension": null,
    "datacontenttype" : "application/xml",
    "data" : "<mixer />"
}

Example 3: Invalid JSON-formatted event (invalid JSON)

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "id" : "B234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "unsetextension": null,
    "datacontenttype" : "application/xml",
    "data" : { ]
}

Example 4: Invalid JSON-formatted event (valid JSON, invalid data type for an extension attribute)

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "id" : "B234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "unsetextension": null,
    "datacontenttype" : "application/xml",
    "invalidextension": [ "Arrays aren't valid" ],
    "data" : "<mixer />"
}

What is being represented?

The upshot of all of the above is that in general, the CloudEvent spec does not have a firm opinion of what the data in a CloudEvent "means". However, in structured mode, event formats effectively become opinionated about the meaning of data in some cases.

For example:

The protobuf event format distinguishes between text data, binary data, and protobuf messages
The XML event format distinguished between text data, binary data, and XML elements
The JSON event format distinguishes between binary data and "any other data, which must be representable as a JSON value even if the content-type isn't JSON"

Transcoding

It is desirable to be able to convert a CloudEvent from one representation to another, e.g. "structured JSON to structured XML",
"structured protobuf to binary" or "binary to structured JSON". This is where the difference in "opinion" causes problems.

The datacontenttype of the CloudEvent is basically the only common information that can be used to inform transcoding - and it feels reasonable for it to do so. Let's look at some examples to try to agree on correct behavior.

Note that this section is not intended to impose constraints on SDKs. Some SDKs may support explicit transcoding operations, some may effectively do so by "decode in one format, encode the result in a different format" - or that may lead to issues. But until we agree on what the right result of transcoding is, it's harder to work out how it should appear in SDKs.

Example 5: transcoding binary to structured JSON, no content-type

HTTP request (where the body after the blank line is UTF-8-encoded text, but that isn't present in the HTTP request header):

ce-id: example5
ce-type: someevent
ce-specversion: 1.0
ce-source: /example

{ "name": "test" }

Option 5a: no inference

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example5",
    "data_base64": "eyAibmFtZSI6ICJ0ZXN0IiB9"
}

Option 5b: infer text

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example5",
    "data": "{ \"name\": \"test\" }"
}

Option 5c: infer JSON

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example5",
    "data": { "name": "test" }
}

Relevant parts of the JSON format spec:

If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value

...

If the datacontenttype is unspecified, processing SHOULD proceed as if the datacontenttype had been specified explicitly as
application/json.

Without any content type at all, should the implementation determine that the type of data is Binary?

Jon's opinion: really unclear; either 5a or 5c seems reasonable.

Corollaries:

What would we expect to do if it's not UTF-8-encoded text at all, e.g. binary data with no content type?
What would we expect to do if it's UTF-8-encoded text, but not valid JSON?

Example 6: Transcoding from structured protobuf to structured JSON, for JSON content

This is the example at the heart of #1186.

Note: this uses the JSON respresentation of the protobuf message. Don't get confused between the two!

Protobuf-format:

{
    "specVersion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example6",    
    "textData": "{ \"name\": \"test\" }",
    "attributes": {
      "datacontenttype": { "ceString": "application/json" }
    }
}

Transcoded JSON format:

Option 6a: (data is a JSON object)

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example6",
    "data": { "name": "test" }
}

Option 6b: (data is a JSON string)

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example6",
    "data": "{ \"name\": \"test\" }"
}

Jon's opinion: Here the data content type says it's JSON, so it's reasonable for the transcoding operation to end up with a JSON object as the result (so option 6a). (Note that option 6b is what at least some SDKs will come out with at the moment.)

Corollaries:
- If we want to end up with "data": "hello" in the JSON format, we'd need "textData": "\"hello\"" in the protobuf format
- If datacontenttype hadn't been specified, would the result be the same? (The "assume it's JSON" part would be separated from the original event creation, only occurring at transcoding time...)

Example 7: Transcoding from structured JSON to binary (numeric data)

Initial JSON-formatted event:

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example7",
    "datacontenttype": "application/xml",
    "data": 7.50
}

What should the binary mode encoding of this event be?

Option 7a: Encode as text

Treat the value as "it's a number, let's just encode it as text". This leads to further questions of:

Which text encoding should we use?
Which locale should we use? ("7.5" vs "7,5"?)
What precision should we use? ("7.5" vs "7.50" - or for integers, "7" or "7.0")

Option 7b: Fail to deserialize from JSON

The JSON format spec states:

If the datacontenttype does not declare JSON-formatted data content, then the data member SHOULD be treated as an encoded content string. An implementation MAY fail to deserialize the event if the data member is not a string, or if it is unable to interpret the data with the datacontenttype.

Jon's opinion: option 7b seems safe and reasonable here.

Example 8: Transcoding from structured JSON to binary (text data)

Initial JSON-formatted event:

{
    "specversion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example8",
    "datacontenttype": "application/xml",
    "data": "Not XML"
}

What should the binary mode encoding of this event be?

Option 8a: Encode as text

We've got text, we can encode it that way, only needing to choose the encoding. (It's probably reasonable to assume UTF-8, but we should document that.)

Option 8b: Fail to serialize as it's invalid XML

Either initial deserialization could fail, or serialization to binary mode (if that's a separate step) could fail, because "Not XML" is not a valid XML document.

Option 8c: Coerce into valid XML

An implementation could transcode the data into some made-up element name, e.g. <event>Not XML</event>.

Jon's opinion: option 8a seems appropriate here. The JSON format has no knowledge of XML, and it should only concern itself with content types it actually knows about. (As for option 8c... please no!)

Example 9: Transcoding from structured protobuf to binary, protobuf message data

Protobuf-format:

{
    "specVersion": "1.0",
    "type": "someevent",
    "source": "/example",
    "id": "example6",    
    "protoData": {
      "@type": "type.googleapis.com/google.profile.Person",
      "firstName": "Jem",
      "lastName": "Day",
    },
    "attributes": {
      "datacontenttype": { "ceString": "application/protobuf" }
    }
}

Option 9a: serialize the Any

Just serialize the value of the proto_data field.

Option 9b: serialize the data

Just use the proto_data.value field (which is already a bytes value).

Jon's opinion: 9b is consistent with normal protobuf transports, where the message type is effectively part of a side-channel, e.g. implicit in the RPC being invoked via gRPC. On the other hand, losing data always feels odd.

What's next?

After discussion of the right result of these examples of transcoding, we can work out the implications for specs and SDKs. Expected changes:

Some format specs may need to be clarified
Some coverage of "what does a format do" may need to be clarified
The SDK documentation in this repo probably needs to be clarified with expectations
SDKs may wish to provide explicit transcoding operations
SDKs may wish to change behavior or offer alternative behavior (depending on compatibility requirements)

The text was updated successfully, but these errors were encountered:

jskeet · 2023-05-10T13:04:26Z

@duglin: If you could add this to the agenda for tomorrow's meeting, that would be useful. I hope this write-up helps...

duglin · 2023-05-10T13:42:22Z

re: Example 5 section:

Jon's opinion: really unclear; either 5a or 5c seems reasonable.

If you're ok with 5a then I think you have to be ok with 5b too since I believe they're basically the same - both have data as an array of bytes. To me it all depends on how the receiver of the binary CE interpreted the HTTP body - meaning, either as JSON (5c) or as an array of bytes (5a, or 5b). All are correct and valid choices w/o more info.

What would we expect to do if it's not UTF-8-encoded text at all, e.g. binary data with no content type?
What would we expect to do if it's UTF-8-encoded text, but not valid JSON?

I'm not sure this changes much for 5a and 5b since both treat it as bytes so whether we use data or data_base64 depends on whether there are binary chars in there, or not. 5c is still a valid choice to try but it might fail because it's not valid JSON. So if the transcoder demands valid JSON then it should fail. If it doesn't demand JSON then it needs to be smart and determine that invalid JSON would need to use 5a or 5b. IOW, it needs to pick the most appropriate serialization based on the data. Which isn't really that different from checking to see if we need to use data_base64 or not.

re: Example 6 section:

Jon's opinion: Here the data content type says it's JSON, so it's reasonable for the transcoding operation to end up with a JSON object as the result (so option 6a). (Note that option 6b is what at least some SDKs will come out with at the moment.)

I think this is mostly correct. I view it this way:

the datacontenttype is really only used when trying to examine (parse) data regardless of how the data is serialized or which attribute it actually appears in (data vs data_base64)
so the input on 6 is saying that data is an array of bytes that happens to look like JSON, and if the receiver wants to parse/validate it then it should do so with a JSON parser because of the "datacontenttype" value of "app/json".
therefore, 6b is not correct IFF the receiver interpreted "data" as JSON - which it should due to "datacontenttype" being there - because the JSON spec says JSON objects should not be serialized as strings. However, if the receiver ignores "datacontenttype" (which it can do since it's optional), and instead treats it as an array of bytes then 6b becomes valid, but not expected.
Net: it's almost an implementation choice because we can't force someone to examine the "datacontenttype" attribute and use it as part of its trancoding.
However, we could choose to be smart about this say: if the transcoder supports a format (JSON in this case, and we know this because it's going to output JSON), then it MUST therefore understand a "datacontenttype" of that same format, so it knows it's not just an array of bytes, it's a serialized JSON object and therefore it needs to convert it into a JSON object and serialize it as such - not as an array of bytes.

re: Example 7 section:

I agree we should probably fail it due to 7b IFF the transcoder examines the datacontenttype. Else, it should result in 7.50 in the HTTP body to try to match the inputs as closely as possible.

re: Example 8 section:

I'm leaning towards 8a as well, which would be consistent with results for section 7 for the case where the datacontenttype is ignored but passed along.

re: Example 8 section:

I don't grok it - where is the proto_data.value field?

Net of all of this:

I'm not seeing anything wrong with the specs yet
I think a transcoder needs to decide how smart it wants to be, but CE can't mandate it
maybe the SDKs need to have the notion of "rawData" (just the bytes please) and "parsedData" (a language specific Object of the data when it understands the datacontenttype). This could then be used for both input and output processing.

jskeet · 2023-05-10T13:45:32Z

I don't grok it - where is the proto_data.value field?

proto_data is of type Any, which has a value field. Happy to go into more detail in the meeting :)

duglin · 2023-05-25T17:14:10Z

on the 5/25 call @jskeet agreed to write-up a summary/proposed-next-steps or perhaps a PR... to try to focus the discussion

jskeet · 2023-06-01T07:24:04Z

Okay, having read through this again to try to summarize it:

I actually think the detail is really important; I don't think anyone who tries to come to a conclusion based just on this comment without reading the rest of what's above is going to be able to contribute usefully. Sorry :(
We should work out whether transcoding "is a thing" - importantly, when Doug mentioned "transcoder" above I've inferred that he expects that to be an SDK-level type; I'm expecting transcoding to be more along the lines of "deserialize, maybe change some things, serialize with a different format" without any part of the SDK knowing the "bigger picture". And it's that lack of extra context that causes some of the issues.

My personal "grand conclusion" is that data formats probably should be able to insist that implementations are aware of some data content types and handle them in a particular way, e.g. inferring that if a JSON format is presented with text data and told its JSON, it should parse it as JSON. That will be a breaking change for many SDKs, I suspect, so we need to really think about it carefully.

There is one thing that I think we can discuss concretely:

However, if the receiver ignores "datacontenttype" (which it can do since it's optional)

I disagree with this. Just because an attribute is optional doesn't mean a receiver should be able to ignore it if it's provided. The JSON format explicitly talks about what should happen if the content type is indicated to be JSON - so I'd argue that any SDK receiver which implements the JSON format but ignores the content type is violating the spec.

Maybe that's a good place to start, in order to chip away at this...

duglin · 2023-06-01T14:29:51Z

From our specs:

For clarity, when a feature is marked as "OPTIONAL" this means that it is OPTIONAL for both the Producer and Consumer of a message to support that feature. In other words, a producer can choose to include that feature in a message if it wants, and a consumer can choose to support that feature if it wants. A consumer that does not support that feature is free to take any action it wishes, including no action or generating an error, as long as doing so does not violate other requirements defined by this specification. However, the RECOMMENDED action is to ignore it. The producer SHOULD be prepared for the situation where a consumer ignores that feature. An Intermediary SHOULD forward OPTIONAL attributes.

Rereading some of the previous comments, I'm still thinking part of the solution might be what I said:

However, we could choose to be smart about this say: if the transcoder supports a format (JSON in this case, and we know this because it's going to output JSON), then it MUST therefore understand a "datacontenttype" of that same format, so it knows it's not just an array of bytes, it's a serialized JSON object and therefore it needs to convert it into a JSON object and serialize it as such - not as an array of bytes.

and we could write this guidance in a generic way to handle any format.

jskeet · 2023-06-01T14:53:16Z

Eek... I hadn't noticed that aspect of "optional" before. I think that was a mistake :( Optionality of understanding/processing and optionality of including should be entirely separated IMO. Too late now...

duglin · 2023-06-01T15:03:02Z

https://datatracker.ietf.org/doc/html/rfc2119#section-5
5. MAY This word, or the adjective "OPTIONAL", mean that an item is
truly optional. One vendor may choose to include the item because a
particular marketplace requires it or because the vendor feels that
it enhances the product while another vendor may omit the same item.
An implementation which does not include a particular option MUST be
prepared to interoperate with another implementation which does
include the option, though perhaps with reduced functionality. In the
same vein an implementation which does include a particular option
MUST be prepared to interoperate with another implementation which
does not include the option (except, of course, for the feature the
option provides.)

I think it's the:
an implementation which does include a particular option MUST be prepared to interoperate with another implementation which does not include the option
part that made us feel like we need to include that clarifying text

jskeet · 2023-06-01T15:35:26Z

Hmm... I view "optional within data" to be very, very different from "this is an optional feature which may or may not be implemented in a conformant platform".

For example, time is optional but I would still absolutely expect every SDK to reject a CloudEvent with a time value of "yesterday". Would like to discuss this at some point, but it's at least somewhat separable from the data issue. We could always discuss just optionality to start with, as a way of punting the tricky bits of data handling while still feeling productive ;)

duglin · 2023-06-01T15:51:34Z

I think part of the reason we landed where we did is that CE (for the most part) is a format-spec, unlike other specs that control semantics. For example, in the xRegistry spec you'll see statements about what a receiver MUST do when an OPTIONAL attribute appears in a message. At that point it's optional to appear, but the semantics of it (when present) are not. We don't have a lot of words in CE around what a receiver does with a CE once it receives it.

One thing we could consider is adding more normative language to SDK.md to ensure they behave the way we want them to. But then we're working on an "SDK spec" and not the "CE spec".

As for this issue... perhaps what we're looking for is something closer to guidance right now, and maybe that'll turn into RFC2119 language at some point. For example, maybe we start with describing what a CE receiver should be doing if it wants to understand data beyond "it's an array of bytes"... like what a trancoder would do.

For example:

if the incoming CE has a datacontenttype you understand then you should interpret the data like this...
if you don't understand it, then interpret the data like this ....
if there is no datacontenttype then ....

Once we agree on that, and hopefully it's generic and not format specific, we can then decide if any of that should be normative in a spec(s). Maybe?

github-actions · 2023-07-02T01:29:09Z