Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 support in validation, and some parsers and formatters #537

Merged
merged 5 commits into from Jan 23, 2024

Conversation

ywwg
Copy link
Contributor

@ywwg ywwg commented Nov 29, 2023

Adds UTF8 metric and label name validation mode and adds support for parsing and formatting the UTF8 text format.

The expfmt package will use the new quoting syntax iff the metric name does not conform to the legacy name format. Old metrics will still be presented as foo{} whereas utf8 metrics will be presented as {"foo"}. (Formatters already assume that names are already validated and will not throw an error if names are invalid.)

The model package has a new library flag, NameValidationScheme, which determines how validation will be done on all calls. If a caller sometimes needs to validate utf8 but sometimes also needs to validate against the legacy scheme, they should use IsValidLegacyMetricName where needed. Changing NameValidationScheme after startup, especially if there are multiple goroutines active, is not recommended and can result in undefined behavior.

NameValidationScheme is Legacy by default, so clients of this library should see no changes in behavior unless they change that value.

There is some minor style cleanup of redundant type names in tests.

Works towards #527

Future work will be needed on the content negotiation code.

@ywwg ywwg changed the title testing for quoting things WIP: UTF-8 support Nov 29, 2023
@ywwg ywwg force-pushed the owilliams/quoted-metric-name branch 2 times, most recently from 7ea4b4c to e2602ac Compare December 4, 2023 18:25
expfmt/text_create.go Outdated Show resolved Hide resolved
expfmt/text_create.go Outdated Show resolved Hide resolved
@ywwg ywwg marked this pull request as ready for review December 4, 2023 19:56
@ywwg ywwg requested a review from beorn7 December 4, 2023 19:59
@beorn7
Copy link
Member

beorn7 commented Dec 5, 2023

I'll try to get to this ASAP.

@ywwg I assume this is ready for review and will change the title accordingly.

@roidelapluie I plan to do a code level review, but eventually, we'll need your approval, too.

@beorn7 beorn7 changed the title WIP: UTF-8 support UTF-8 support Dec 5, 2023
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

As discussed, the package-level "config variable" might be the least bad way of "flipping the switch". However, I'm not sure if just one switch is enough for our use case. I see three modes:

  1. Use this library in exactly the old way, i.e. this PR shouldn't change the behavior at all.
  2. Use this library in the new way, i.e. all UTF-8 characters are allowed in names, and new exposition formats are used accordingly.
  3. Allow the new way but still accommodate legacy clients, i.e. special UTF-8 characters need to be escaped with the schema from the design doc.

It seems this PR fulfills (1) and (2), by setting NameValidationScheme accordingly. However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

In different news, text_parse.go eventually needs an update to parse the new classic text format. In Prometheus, we use a different parser implementation, so it's not in the critical path to make Prometheus work, but there should be at least a TODO somewhere.

expfmt/openmetrics_create.go Outdated Show resolved Hide resolved
expfmt/openmetrics_create.go Outdated Show resolved Hide resolved
expfmt/openmetrics_create.go Outdated Show resolved Hide resolved
expfmt/openmetrics_create_test.go Outdated Show resolved Hide resolved
expfmt/text_create.go Outdated Show resolved Hide resolved
model/labels.go Outdated Show resolved Hide resolved
model/metric.go Outdated Show resolved Hide resolved
@beorn7
Copy link
Member

beorn7 commented Dec 13, 2023

About the strconv.Quote thing: text_create.go contains writeEscapedString, which should do the trick.

@ywwg ywwg changed the title UTF-8 support UTF-8 support in parsers and formatters Dec 14, 2023
@ywwg
Copy link
Contributor Author

ywwg commented Dec 14, 2023

However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

I believe this is covered with the existence of the IsValidLegacyMetricName function -- the library can be enabled for mode 2 in general, and then specific calls can apply that function when needed.

Ah, I missed text_parse.go! I will put that on the list

@beorn7
Copy link
Member

beorn7 commented Dec 14, 2023

However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

I believe this is covered with the existence of the IsValidLegacyMetricName function -- the library can be enabled for mode 2 in general, and then specific calls can apply that function when needed.

Ah, I missed text_parse.go! I will put that on the list

I was more thinking about the code in expfmt/encode.go and expfmt/expfmt.go.

We need new Content-Type strings for "format XYZ but with arbitrary UTF-8 characters allowed in names" (see design doc).

And then the Negotiate and NegotiateIncludingOpenMetrics functions need to return them appropriately, and then NewEncoder has to return a correspondingly configured Encoder.

@ywwg
Copy link
Contributor Author

ywwg commented Dec 15, 2023

ah ok. I'll take a look

@ywwg
Copy link
Contributor Author

ywwg commented Dec 15, 2023

OK so the question is how to select the escaping mode for old recipients, and that mode might be different for different situations. I think we'll use a similar pattern to the parsing situation: a library-wide setting for default escaping, and then a new function where users can specify the scaping.

So the current MetricFamilyToOpenMetrics and MetricFamilyToText functions will use the library-default escaping, and then I'll make a couple new functions that take an extra parameter to specify an escaping on a call-by-call basis.

@ywwg
Copy link
Contributor Author

ywwg commented Dec 15, 2023

It might make more sense to shrink this PR down so it just handles the parsing, and then do another one to handing the exposition format creation, what do you think? Otherwise it might start to get large

@beorn7
Copy link
Member

beorn7 commented Dec 15, 2023

You can divide things as you see fit.

Note that parsing isn't really urgent because the parser here in prometheus/common isn't even used by prometheus/prometheus. (I'm not even sure where it is used at all.)

In contrast, the code path in expfmt/encode.go is critical for prometheus/client_golang because this is how the format is selected.

@ywwg
Copy link
Contributor Author

ywwg commented Dec 15, 2023

open question: who is in charge of which escaping mechanism gets used? should that be part of the content negotiation, like the scraper says "I wish to have things escaped this way" or is that a setting on the metrics producer?

@ywwg ywwg marked this pull request as draft December 15, 2023 21:31
@ywwg
Copy link
Contributor Author

ywwg commented Dec 15, 2023

converting to draft while I do some work

@beorn7
Copy link
Member

beorn7 commented Dec 19, 2023

open question: who is in charge of which escaping mechanism gets used? should that be part of the content negotiation, like the scraper says "I wish to have things escaped this way" or is that a setting on the metrics producer?

The MVP here is that any UTF-8-enabled binary instrumented with prometheus/client_golang should still be able to be scraped by a non-UTF-8-enabled scraper. I would assume that means to use the U__ escaping schema described in the design doc. This should be done via content negotiation, as also described in the design doc.

Allowing other escaping schemas would IMHO be "nice to have" (so that users can pick them if they need them for compatibility with legacy systems). Those other escaping schemas already exist, but they have no support in client_golang. Therefore, I see no pressing needs to add support for them now. (One might even argue that we actually want to discourage the other schemas ASAP, and providing UTF-8 support plus a sophisticated escaping schema for old backends has been invented precisely to get rid of those other, problematic escaping schemas.) But if we want to support them, I guess we should use the same knob to select them, i.e. content negotiation. It ultimately depends on the use case, but I'm not sure yet what the use case for those other escaping schemas will be.

@ywwg
Copy link
Contributor Author

ywwg commented Dec 28, 2023

After doing a bunch of work, indeed I think it's best to put content negotiation in a separate PR. I believe this PR still does what it says and should not negatively affect people using this library (unless they turn on the UTF8 validation mode, which is not really supported anywhere else and will result in errors)

@ywwg ywwg marked this pull request as ready for review December 28, 2023 19:40
@ywwg
Copy link
Contributor Author

ywwg commented Dec 28, 2023

(I will squash this and sign the commit when review is done to solve the DCO failure)

@beorn7
Copy link
Member

beorn7 commented Jan 9, 2024

After doing a bunch of work, indeed I think it's best to put content negotiation in a separate PR. I believe this PR still does what it says and should not negatively affect people using this library (unless they turn on the UTF8 validation mode, which is not really supported anywhere else and will result in errors)

Fair enough. But I think we need to document a bit better what the precise behavior after this PR is and what still has to be done. I'm mostly worried that it could now happen very easily that somebody uses the code to produce metrics rendered in the new UTF-8 capable style after negotiating the existing old formats. (Once content negotiation is fully implemented, the encoders would use the underscore encoding described in the design doc in that case.)

I guess the doc comment for func MetricFamilyToText and func MetricFamilyToOpenMetrics should include how the output changes if metric names or label names are using characters excluded in the legacy format, and that it is the callers responsibility to not include names with those characters if valid legacy format is desired.

@beorn7
Copy link
Member

beorn7 commented Jan 9, 2024

To clarify: Despite the title, this PR doesn't really change the parsing side, does it? I mean, the configurable validation implicitly adds support to the protobuf decoder. But the other parsers don't change. Maybe that could be added as a TODO.

Also, a TODO in the content negotiation part might help to avoid confusion.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise LGTM. Just need the hints in doc comments as discussed in the comments.

model/metric.go Outdated
"unicode/utf8"
)

type validationSchemeId int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go pseudo enums keep confusing me. Maybe it's better to make this type exported? And here or below add doc comments about how the two options change the behavior (which might just be a reference to IsValidMetricName below, like: "See IsValidMetricName for details how the validation changes.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ywwg ywwg force-pushed the owilliams/quoted-metric-name branch from 3078b7f to cbd00f5 Compare January 11, 2024 14:15
@ywwg ywwg changed the title UTF-8 support in parsers and formatters UTF-8 support in validation, and some parsers and formatters Jan 11, 2024
@ywwg
Copy link
Contributor Author

ywwg commented Jan 11, 2024

I guess the doc comment for func MetricFamilyToText and func MetricFamilyToOpenMetrics should include how the output changes if metric names or label names are using characters excluded in the legacy format

the comment already says "this function assumes the input is already sanitized and does not perform any
// sanity checks. If the input contains duplicate metrics or invalid metric or
// label names, the conversion will result in invalid text format output."

Do you want more than that?

@ywwg
Copy link
Contributor Author

ywwg commented Jan 11, 2024

oh I get it now, adding...

Signed-off-by: Owen Williams <owen.williams@grafana.com>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. And sorry for requesting a few more doc comment tweaks.

About the TODO I would like to see in the content negotiation part: In https://github.com/prometheus/common/blob/main/expfmt/expfmt.go , there should be a TODO that reads something like this: "The content-type strings here are all for the legacy exposition formats, where valid characters for metric names and label names are limited. The support for arbitrary UTF-8 characters in those names is already partially implemented in this module (see model.ValidationScheme), but to actually use it on the wire, new content-type strings have to be agreed upon and added here."

expfmt/openmetrics_create.go Show resolved Hide resolved
expfmt/text_create.go Show resolved Hide resolved
@@ -330,32 +324,62 @@ func writeSample(
return written, nil
}

// writeLabelPairs converts a slice of LabelPair proto messages plus the
// writeNameAndLabelPairs converts a slice of LabelPair proto messages plus the
// explicitly given additional label pair into text formatted as required by the
// text format and writes it to 'w'. An empty slice in combination with an empty
// string 'additionalLabelName' results in nothing being written. Otherwise, the
// label pairs are written, escaped as required by the text format, and enclosed
// in '{...}'. The function returns the number of bytes written and any error
// encountered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment here would benefit from an update for the new features (additionally passed name, special syntax/quoting for name and label names containing formerly invalid characters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.

But what do you think about my earlier comment? This one:

About the TODO I would like to see in the content negotiation part: In https://github.com/prometheus/common/blob/main/expfmt/expfmt.go , there should be a TODO that reads something like this: "The content-type strings here are all for the legacy exposition formats, where valid characters for metric names and label names are limited. The support for arbitrary UTF-8 characters in those names is already partially implemented in this module (see model.ValidationScheme), but to actually use it on the wire, new content-type strings have to be agreed upon and added here."

Maybe you just missed it?

Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg
Copy link
Contributor Author

ywwg commented Jan 22, 2024

ah, sorry about that! done now

@beorn7 beorn7 merged commit bd0376d into prometheus:main Jan 23, 2024
6 checks passed
@beorn7
Copy link
Member

beorn7 commented Jan 23, 2024

Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants