Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Schema] Converting values that cause conflicts #3497

Open
MovieStoreGuy opened this issue May 11, 2023 · 34 comments
Open

[Schema] Converting values that cause conflicts #3497

MovieStoreGuy opened this issue May 11, 2023 · 34 comments
Assignees
Labels
spec:miscellaneous For issues that don't match any other spec label triage:deciding:community-feedback

Comments

@MovieStoreGuy
Copy link

MovieStoreGuy commented May 11, 2023

What are you trying to achieve?
When building the the schema processor for the collector, it was noted that there is no defined behaviour on what to expect if there is an existing value that would be overridden.

What did you expect to see?

I've currently made the assumption that the schema conversions take priority over existing values and return an error that can be consumed by the collector signify a metric (or log) on how many conflicts were found and report that.

Additional context.

Context: open-telemetry/opentelemetry-collector-contrib#17020 (comment)

@jsuereth
Copy link
Contributor

jsuereth commented May 15, 2023

We discussed this in the Semantic Convention WG.

Suggestions from there:

  • We think this may happen when folks try to emit "multiple" versions of Schema to alleviate migration friction. We don't have hooks in place to deal with this.
  • We suggest leaving the existing value and NOT overwriting with the change when this scenario occurs.

Specifically, if you have the following attributes:

namespace.attribute1=foo
namespace.attribute2=bar

If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:

namspace.attribute2=bar

That is, attribute1 does NOT overwrite attribute2, regardless of how the attributes are ordered.

This is specifically in reference to the OTLP generated.

Additionally, you could leave attribute1 in place as well.

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 15, 2023
Resolves: open-telemetry#3497

Alternates Considered
=====================

Instead of requiring overwriting the input data we could say that the
transformation SHOULD be aborted if a conflict is detected or that transformation
SHOULD be ignored if a conflict is detected.

The chosen approach (overwriting) seems to be best balance between maintaining
reasonable outcome (keep the more important data, discarding the less important data)
while also having minimal complexity of implementation (aborting and reporting errors
is more complicated and more costly).
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 15, 2023
Resolves: open-telemetry#3497

Alternates Considered
=====================

Instead of requiring overwriting the input data we could say that the
transformation SHOULD be aborted if a conflict is detected or that transformation
SHOULD be ignored if a conflict is detected.

The chosen approach (overwriting) seems to be best balance between maintaining
reasonable outcome (keep the more important data, discarding the less important data)
while also having minimal complexity of implementation (aborting and reporting errors
is more complicated and more costly).

Why "SHOULD" and not "MUST"?
============================

There are significant tradeoffs involved in what happens in the schema transform
and we believe there may be valid use cases where after careful consideration a
different approach may be taken (e.g. reject and surface the problem to the end user).
We do not want to prohibit these other approaches. The goal of this change is to merely
have a reasonable default interpretation of the schema transformation rules.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 15, 2023
Resolves: open-telemetry#3497

Alternates Considered
=====================

Instead of requiring overwriting the input data we could say that the
transformation SHOULD be aborted if a conflict is detected or that transformation
SHOULD be ignored if a conflict is detected.

The chosen approach (overwriting) seems to be best balance between maintaining
reasonable outcome (keep the more important data, discarding the less important data)
while also having minimal complexity of implementation (aborting and reporting errors
is more complicated and more costly).

Why "SHOULD" and not "MUST"?
============================

There are significant tradeoffs involved in what happens in the schema transform
and we believe there may be valid use cases where after careful consideration a
different approach may be taken (e.g. reject and surface the problem to the end user).
We do not want to prohibit these other approaches. The goal of this change is to merely
have a reasonable default interpretation of the schema transformation rules.
@tigrannajaryan
Copy link
Member

We suggest leaving the existing value and NOT overwriting with the change when this scenario occurs.

What exactly does this mean? Do you suggest that the input attributes are preserved and the SchemaURL is not updated? What happens with transforms that were already applied? Do you have to roll them back? This is very expensive to implement.

I don't think this is the best approach.

I submitted a PR where I recommend overwriting. IMO this is the best default approach.

Rejecting is complex and expensive. It requires performing transformation on a copy of data since it may be aborted any time or be able to rollback already applied changes. This is very expensive unless you have some very efficient copy-on-write data structure or something like that. I don't think we should do it.

Overwriting is much simpler. You just apply on the original data knowing that there is no need to ever abort the changes. IMO, this should be the philosophy of all schema transforms (existing or future proposals).

I left the door open for implementations that want to carry the burden of much more complex approaches by using SHOULD clause instead of MUST.

@tigrannajaryan
Copy link
Member

If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:

namspace.attribute2=bar

That is, attribute1 does NOT overwrite attribute2, regardless of how the attributes are ordered.

I think this is a bad outcome. The resulting data complies neither with old schema nor with new schema. What do we set the SchemaURL to in this case?

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue May 15, 2023
Resolves: open-telemetry#3497

Alternates Considered
=====================

Instead of requiring overwriting the input data we could say that the
transformation SHOULD be aborted if a conflict is detected or that transformation
SHOULD be ignored if a conflict is detected.

The chosen approach (overwriting) seems to be best balance between maintaining
reasonable outcome (keep the more important data, discarding the less important data)
while also having minimal complexity of implementation (aborting and reporting errors
is more complicated and more costly).

Why "SHOULD" and not "MUST"?
============================

There are significant tradeoffs involved in what happens in the schema transform
and we believe there may be valid use cases where after careful consideration a
different approach may be taken (e.g. reject and surface the problem to the end user).
We do not want to prohibit these other approaches. The goal of this change is to merely
have a reasonable default interpretation of the schema transformation rules.
@lmolkova
Copy link
Contributor

lmolkova commented May 15, 2023

What exactly does this mean? Do you suggest that the input attributes are preserved and the SchemaURL is not updated? What happens with transforms that were already applied? Do you have to roll them back? This is very expensive to implement.

We can remove schema URL from invalid telemetry or add a way to signal that we could not apply transformation on a given data. I think having a way to signal about it is essential for this and other similar problems and should be a prerequisite to removing moratorium on transformations.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented May 15, 2023

To summarize some options (there is probably more possibilities):

Overwrite

Pros

  • Likely simplest and most efficient to implement.

Cons

  • Can lose (overwrite) data if it is not Otel compliant (custom attributes incorrectly using Otel namespaces)

Ignore, Skip

Pros

  • Similarly simple to implement as Overwrite (maybe a bit more complex to be careful to check existing attributes)

Cons

  • Can result in output data that is not compliant with any Otel schema, part of attributes complying with old schema, part with new.

Preserve and Overwrite

Before overwriting preserve the value of the conflicting attribute in some other attribute (can be a bag of attributes like that or can have some renaming rules).

Pros

  • Also reasonably simple to implement (a bit more complex than Overwrite).
  • No data loss.

Cons

  • May not be immediately visible where the non-compliant is moved (custom attributes incorrectly using Otel namespaces)

[UPDATE 2023-05-17] A possible way to preserve the conflicting attribute is to rename it and append a suffix equal to the source schema version number. For example when converting the data from 1.0.0 to 1.1.0 we would append a suffix 1.0.0 to the attribute name. Assuming 1.0.0->1.1.0 renames host to host.name and we have the following input data:

{
  "attributes": { "host":  "spool", "host.name": "spool.example.com" }
}

The output data will become:

{
  "attributes": { "host.name":  "spool", "host.name_1.0.0": "spool.example.com" }
}

This approach allows applying multiple version transformations and preserve all conflicting attributes such that it is also captured in a human-readable form which "previous" version of the schema the original data was when the conflict was detected and thus can be a useful hint for understanding the supposed semantics of the original data.

Another possible way to preserve the conflicting attributes would be to put them all as sub-keys under some well-known attribute, however it is currently not allowed to have nested attributes (see #376)

Detect and Reject/Rollback

Pros

  • Guarantee that output data is either not touched or converted to new schema without any data loss

Cons

  • Hard and/or inefficient to implement. Need a data structure that can be rollbacked or need to work on a copy, etc.

@lmolkova
Copy link
Contributor

Ignore, Skip option with a way to notify about broken data could be a first step that does not seem to block supporting Detect and Reject/Rollback in future

@tigrannajaryan
Copy link
Member

We can remove schema URL from invalid telemetry or add a way to signal that we could not apply transformation on a given data.

I think that's worse than overwriting a non-compliant attribute. To me the offender who incorrectly uses a custom attribute name should be punished, instead we chose for a collateral damage and loose the information about schema :-)

@lmolkova
Copy link
Contributor

To me the offender who incorrectly uses a custom attribute name should be punished

An application developer (e.g. an early adopter) who has added an attribute that is now used in their alerts, dashboards and business reports should have it all broken because now some automation in their vendor ingestion pipeline decided schema is more important?

@tigrannajaryan
Copy link
Member

Ignore, Skip option with a way to notify about broken data could be a first step that does not seem to block supporting Detect and Reject/Rollback in future

A combination of Overwrite+Detect is also possible. Detect that a conflicting attribute exists, rename it to something that preserves it and then overwrite. This way there is no loss of data, but we still forcefully make the data compliant with the target schema version.

@lmolkova
Copy link
Contributor

renaming an attribute set by application for the sake of compliance with spec and breaking their tooling does not look reasonable.

@lmolkova
Copy link
Contributor

If some important UX depends on compliance with schema, such users will suffer from not getting this experience and will go and fix it themselves.

@tigrannajaryan
Copy link
Member

To me the offender who incorrectly uses a custom attribute name should be punished

An application developer (e.g. an early adopter) who has added an attribute that is now used in their alerts, dashboards and business reports should have it all broken because now some automation in their vendor ingestion pipeline decided schema is more important?

Yes, I think so. The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen (I don't see what does it have to do with being an early adopter). If they don't follow the rules there are consequences.

Obviously, this is just one opinion, if we don't find one good approach that we all like then perhaps this needs to be a configurable option.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented May 15, 2023

Also, a reminder to make sure we are aware of the context of the discussion: the data loss is only applicable to use-cases where the data is actually transformed before storing, e.g. in a Collector transformer processor or equivalent "before-storage" transformer.

Schemas also can be used for query-time transformations where there is no data loss. The "overwrite" rule merely defines how the queries are rewritten.

@lmolkova
Copy link
Contributor

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

@tigrannajaryan
Copy link
Member

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

@lmolkova
Copy link
Contributor

lmolkova commented May 15, 2023

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

It does not tell me I cannot add http.foo.bar attribute in my application and does not contain any normative language.

@lmolkova
Copy link
Contributor

and as an early adopter I'm free to follow the advice here "The name may be generally applicable to applications in the industry. In that case consider submitting a proposal to this specification to add a new name to the semantic conventions, and if necessary also to add a new namespace." and create http.foo.bar, send the PR to the spec and eventually it might make it to the spec. I should be able to use it before it officially happens.

@tigrannajaryan
Copy link
Member

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

It does not tell me I cannot add http.foo.bar attribute in my application and does not contain any normative language.

I think you are right, that page is not well written. I know that the intent was precisely to avoid the conflicts of custom attributes with Otel names. We need to fix it to make it clearer and use normative language.

@pyohannes
Copy link
Contributor

Yes, I think so. The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen (I don't see what does it have to do with being an early adopter). If they don't follow the rules there are consequences.

For the Overwrite approach, the consequence for not following the rules is data-loss. For the Skip/Ignore approach, the consequence for not following the rules is data that is only partly schema compliant.

In my opinion, the latter is more user-friendly and less invasive. Many of the OTel users I work with would be very critical of their telemetry data being dropped because they don't follow certain "recommendations".

@tigrannajaryan
Copy link
Member

In my opinion, the latter is more user-friendly and less invasive. Many of the OTel users I work with would be very critical of their telemetry data being dropped because they don't follow certain "recommendations".

We drop data elsewhere too if it does not fit the requirements, e.g. drop attributes if the count exceeds the configured limit. We record the fact of dropping, but don't keep a copy of dropped data. In that case too the consequence of not following the rules is data-loss. If it is acceptable for setting the attributes in the API then I think it is similarly acceptable for transformations in the Collector.

That said, I don't think that's the best we can do. It seems to me "Preserve and Overwrite" may be a better option. We will preserve the data (although move to a different place, but readily accessible) and also end up with a consistent shape of the output data, complying with the target schema. What does everyone think about this approach?

@tigrannajaryan
Copy link
Member

@open-telemetry/specs-approvers any thoughts on the possible options outlined above?

@mx-psi
Copy link
Member

mx-psi commented May 17, 2023

"Preserve and Overwrite" is the option that makes the most sense to me, either with an explicit bag of attributes or just with a count of the overwritten attributes. Producing non-compliant data as "Ignore" would do seems like a no-go to me, and it is unclear to me how we would implement "Detect and Reject/Rollback".

@tigrannajaryan
Copy link
Member

I updated "Preserve and Overwrite" option above to show how we can possibly preserve the conflicting data. I am inclining towards this solution, it seem the best available from the 4 listed options so far.

@lmolkova
Copy link
Contributor

lmolkova commented May 17, 2023

"Preserve and Override" changes the name of an attribute, which can break user dashboards, alerts, and queries. Given that schema transformation may be part of the vendor ingestion pipeline, users won't even control it. I.e. this change would be breaking for existing user infra build on top of this data.

When conversion happens inside user pipeline, investigating why this data was changed and why everything is broken will be difficult.

So it seems, we're ready to break user tooling for the sake of compliance with later version of otel schema. To me, the benefit does not outweigh the risk and I'm still on "Ignore, Skip" + warn.

@lmolkova
Copy link
Contributor

Thinking more about it, it made me realize that schema transformation by default should preserve original attributes to avoid breaking existing custom tooling built on top of such attributes - #3510

@mx-psi
Copy link
Member

mx-psi commented May 17, 2023

Thinking more about it, it made me realize that schema transformation by default should preserve original attributes to avoid breaking existing custom tooling built on top of such attributes

I don't have a strong opinion on this (one could also argue that vendors are the ones that need to handle that) but I feel like this is independent from this issue; we should discuss this issue assuming the current state of the spec which AIUI is that original attributes are removed.

@lmolkova
Copy link
Contributor

lmolkova commented May 17, 2023

We drop data elsewhere too if it does not fit the requirements, e.g. drop attributes if the count exceeds the configured limit. We record the fact of dropping, but don't keep a copy of dropped data. In that case too the consequence of not following the rules is data-loss. If it is acceptable for setting the attributes in the API then I think it is similarly acceptable for transformations in the Collector.

the difference is that current behavior happens in-process, it's stable and easy to discover with tests. We also give users the ability to control this behavior by configuring limits.

In case of schema transformation, this behavior becomes unstable - someone changes to latest version (either it's processor inside user's collector pipeline, or in the vendor pipeline) and accidentally your data is dropped or moved. This is never discovered by tests because happens much later and is out-of-process.

I suggest the following in #3510

  • by default schema transformation preserves all original attributes as they are. It does not rename but adds new ones.
    • if there is a collision it preserves the old value.
  • it has a mode when it's allowed to break stuff, there it can drop, rename or do anything else. This is opt-in. Vendors are free to opt-in their users in distros.
  • In any case, we need a warning/notification mechanism for schema transformation so it warns about collisions and unmet expectations (with throttling).

@tigrannajaryan
Copy link
Member

by default schema transformation preserves all original attributes as they are. It does not rename but adds new ones.

  • if there is a collision it preserves the old value.

@lmolkova From what I understand this is the same as "Ignore, Skip" approach described above. Can you clarify what you suggest to happen with SchemaURL in there is a collision? Do we update the SchemaURL or keep the old value? Also, how do we address the fact that the data as a whole after such partially applied transformation does not comply with any Otel semconv version, neither the old one nor the new one?

@lmolkova
Copy link
Contributor

lmolkova commented May 17, 2023

@lmolkova From what I understand this is the same as "Ignore, Skip" approach described above.

Correct.

I meant preserving the old value on the same attribute. I.e.

namespace.attribute1=foo
namespace.attribute2=bar
If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:

namespace.attribute1=foo
namespace.attribute2=bar

Can you clarify what you suggest to happen with SchemaURL in there is a collision?

A few options:

  1. Remove the schema URL or register an invalid value. E.g. otel.io/schemas/inconistent, otel.io/schemas/unknown, otel.io/schemas/undetermined
  2. roll back.
    • With default mode that does not remove old attributes, it should be easier.
    • With opt-in "break things mode", there is no problem - put a new schema URL and modify telemetry as needed

if we cannot provide reasonable defaults, stability guarantees, performance, and observability into edge cases for schema transformation, it means it should just stay experimental and not be used for stable conventions.
I assume the bar for attribute renames to be very high after stability is declared and don't quite understand why schema transformation will be necessary.

@tigrannajaryan
Copy link
Member

I spent some time to get a better feel for what it would take to implement the "Rollback" option and what the impact will be.

I wrote a toy implementation and run some benchmarks with rollback-ability enabled and disabled.

Toy source code here: https://github.com/tigrannajaryan/telemetry-schema/blob/main/schema/compiled/common_actions.go

The schema converter creates a changelog that can be used for rolling the changes back (for whatever reason, including if conflicts are detected).

Here are the benchmarking results:

name                                    time/op
ConvertSchema/Spans/Non-rollbackable-8   101µs ± 4%
ConvertSchema/Spans/Rollbackable-8       143µs ±12%
Unmarshal/Spans-8                        518µs ± 4%
Marshal/Spans-8                          227µs ± 7%

The first too lines is schema conversion of 100 spans, with and without rollback capability enabled (but without doing any rollback, this just creates the changelog). It tests only the success case, when there are no conflicts and, there is no need to rollback. I believe this is the most important case to care about from performance perspective.

As you can see enabling changelog adds about 40µs (per 100 spans) and roughly 1.4 times slows down the conversion process. As a reference point I also included benchmarks for marshalling/unmarshall the exact same data, which is roughly what a Collector with otlp receiver, a schema processor and an otlp exporter would do.

As a rough number enabling changelog would add overall about 5% of CPU time, which is not insignificant, but I don't think it is a dealbreaker.

Furthermore, I think there is room to improve performance of the changelog if needed (at the cost of complexity and maintainability).

Also take the benchmark with a huge grain of salt. It is just one scenario, with one particular schema conversion. We can validate this further, but for now it is enough for me to get a very approximate sense of expected performance.

So, given all the above, in my opinion, performance is not going to be the decisive factor. At the moment I am more concerned with the complexity of implementation of the changelog and the opportunity to introduce bugs. IMO, it is not terribly complicated, but still relatively easy to make a mistake, and especially since it is in a code that is exercised rarely (rollbacks will be rare) the bugs introduced may be hard to notice.

To be clear: I would prefer the "Rollback" option since it results in the best functional outcome IMO: no corrupt data belonging to no schema, clear ability to see the errors if needed, etc.

I would like an opinion of @open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers @open-telemetry/collector-maintainers @open-telemetry/collector-contrib-maintainer on this.

@MovieStoreGuy
Copy link
Author

To be clear: I would prefer the "Rollback" option since it results in the best functional outcome IMO: no corrupt data belonging to no schema, clear ability to see the errors if needed, etc.

I worry about rollbacks because it feels like silently failing but with more steps, I also worry about the cognitive burden on developing schemas since it could be possible to add in a conflict without realising it.

If this has been deployed on collector gateway so data mutations happen on egress to vendor, but a conflict caused rollbacks of the data then my expected visualisations and alerts are broken if I am expecting them to be of a fixed version.

Not sure of the likelihood of this scenario but I can see it becoming a higher chance of happening while transition from manual instrumentation to a hybrid of auto instrumentation.

@MovieStoreGuy
Copy link
Author

@tigrannajaryan , let me bring this to the sig.

@tigrannajaryan
Copy link
Member

@tigrannajaryan , let me bring this to the sig.

@MovieStoreGuy thanks, please do.

@dmitryax
Copy link
Member

The level of complexity required to support the rollback mechanism seems excessive to me, considering that such conflicts should be infrequent occurrences. Additionally, if I configure the schema transform processor to generate a specific schema version, I would expect the processor to make its best attempt to translate the data to that version. Otherwise, I would opt to use a different processor and modify the data accordingly.

I believe we can produce the data as close to the required state as possible. Even if we encounter a conflict, we can utilize the value set by the user, which may not precisely adhere to the specification but will, at least, have the correct attribute name. We might encounter this situation regardless in the following scenario:

  • We have a rename_attributes of namespace.attribute1: namespace.attribute2 for version 1 -> 1.1` translation
  • User sends namespace.attribute2=bar with the schema incorrectly identified as version 1

In that case, translation isn't triggered, and we simply update the schema from version 1 to 1.1. Therefore, I believe the option originally suggested by @lmolkova and @jsuereth (Ignore, Skip in the list) is reasonable, but I'm not sure if it should be the default behavior. It can be a boolean configuration option like override_existing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:miscellaneous For issues that don't match any other spec label triage:deciding:community-feedback
Projects
None yet
9 participants