UTF-8 support in validation, and some parsers and formatters #537

ywwg · 2023-11-29T17:23:10Z

Adds UTF8 metric and label name validation mode and adds support for parsing and formatting the UTF8 text format.

The expfmt package will use the new quoting syntax iff the metric name does not conform to the legacy name format. Old metrics will still be presented as foo{} whereas utf8 metrics will be presented as {"foo"}. (Formatters already assume that names are already validated and will not throw an error if names are invalid.)

The model package has a new library flag, NameValidationScheme, which determines how validation will be done on all calls. If a caller sometimes needs to validate utf8 but sometimes also needs to validate against the legacy scheme, they should use IsValidLegacyMetricName where needed. Changing NameValidationScheme after startup, especially if there are multiple goroutines active, is not recommended and can result in undefined behavior.

NameValidationScheme is Legacy by default, so clients of this library should see no changes in behavior unless they change that value.

There is some minor style cleanup of redundant type names in tests.

Works towards #527

Future work will be needed on the content negotiation code.

expfmt/text_create.go

beorn7 · 2023-12-05T15:23:13Z

I'll try to get to this ASAP.

@ywwg I assume this is ready for review and will change the title accordingly.

@roidelapluie I plan to do a code level review, but eventually, we'll need your approval, too.

beorn7

Thanks for doing this.

As discussed, the package-level "config variable" might be the least bad way of "flipping the switch". However, I'm not sure if just one switch is enough for our use case. I see three modes:

Use this library in exactly the old way, i.e. this PR shouldn't change the behavior at all.
Use this library in the new way, i.e. all UTF-8 characters are allowed in names, and new exposition formats are used accordingly.
Allow the new way but still accommodate legacy clients, i.e. special UTF-8 characters need to be escaped with the schema from the design doc.

It seems this PR fulfills (1) and (2), by setting NameValidationScheme accordingly. However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

In different news, text_parse.go eventually needs an update to parse the new classic text format. In Prometheus, we use a different parser implementation, so it's not in the critical path to make Prometheus work, but there should be at least a TODO somewhere.

expfmt/openmetrics_create.go

expfmt/openmetrics_create_test.go

expfmt/text_create.go

model/labels.go

model/metric.go

beorn7 · 2023-12-13T15:30:17Z

About the strconv.Quote thing: text_create.go contains writeEscapedString, which should do the trick.

ywwg · 2023-12-14T15:41:20Z

However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

I believe this is covered with the existence of the IsValidLegacyMetricName function -- the library can be enabled for mode 2 in general, and then specific calls can apply that function when needed.

Ah, I missed text_parse.go! I will put that on the list

beorn7 · 2023-12-14T17:32:08Z

However, the expfmt package also implements content negotiation, and this has to be modified to allow mode (3).

I believe this is covered with the existence of the IsValidLegacyMetricName function -- the library can be enabled for mode 2 in general, and then specific calls can apply that function when needed.

Ah, I missed text_parse.go! I will put that on the list

I was more thinking about the code in expfmt/encode.go and expfmt/expfmt.go.

We need new Content-Type strings for "format XYZ but with arbitrary UTF-8 characters allowed in names" (see design doc).

And then the Negotiate and NegotiateIncludingOpenMetrics functions need to return them appropriately, and then NewEncoder has to return a correspondingly configured Encoder.

ywwg · 2023-12-15T13:59:34Z

ah ok. I'll take a look

ywwg · 2023-12-15T14:16:50Z

OK so the question is how to select the escaping mode for old recipients, and that mode might be different for different situations. I think we'll use a similar pattern to the parsing situation: a library-wide setting for default escaping, and then a new function where users can specify the scaping.

So the current MetricFamilyToOpenMetrics and MetricFamilyToText functions will use the library-default escaping, and then I'll make a couple new functions that take an extra parameter to specify an escaping on a call-by-call basis.

ywwg · 2023-12-15T14:28:12Z

It might make more sense to shrink this PR down so it just handles the parsing, and then do another one to handing the exposition format creation, what do you think? Otherwise it might start to get large

beorn7 · 2023-12-15T14:59:38Z

You can divide things as you see fit.

Note that parsing isn't really urgent because the parser here in prometheus/common isn't even used by prometheus/prometheus. (I'm not even sure where it is used at all.)

In contrast, the code path in expfmt/encode.go is critical for prometheus/client_golang because this is how the format is selected.

ywwg · 2023-12-15T21:31:07Z

open question: who is in charge of which escaping mechanism gets used? should that be part of the content negotiation, like the scraper says "I wish to have things escaped this way" or is that a setting on the metrics producer?

ywwg · 2023-12-15T21:31:42Z

converting to draft while I do some work

beorn7 · 2023-12-19T20:38:23Z

open question: who is in charge of which escaping mechanism gets used? should that be part of the content negotiation, like the scraper says "I wish to have things escaped this way" or is that a setting on the metrics producer?

The MVP here is that any UTF-8-enabled binary instrumented with prometheus/client_golang should still be able to be scraped by a non-UTF-8-enabled scraper. I would assume that means to use the U__ escaping schema described in the design doc. This should be done via content negotiation, as also described in the design doc.

Allowing other escaping schemas would IMHO be "nice to have" (so that users can pick them if they need them for compatibility with legacy systems). Those other escaping schemas already exist, but they have no support in client_golang. Therefore, I see no pressing needs to add support for them now. (One might even argue that we actually want to discourage the other schemas ASAP, and providing UTF-8 support plus a sophisticated escaping schema for old backends has been invented precisely to get rid of those other, problematic escaping schemas.) But if we want to support them, I guess we should use the same knob to select them, i.e. content negotiation. It ultimately depends on the use case, but I'm not sure yet what the use case for those other escaping schemas will be.

ywwg · 2023-12-28T19:40:45Z

After doing a bunch of work, indeed I think it's best to put content negotiation in a separate PR. I believe this PR still does what it says and should not negatively affect people using this library (unless they turn on the UTF8 validation mode, which is not really supported anywhere else and will result in errors)

ywwg · 2023-12-28T19:44:48Z

(I will squash this and sign the commit when review is done to solve the DCO failure)

beorn7 · 2024-01-09T14:57:27Z

After doing a bunch of work, indeed I think it's best to put content negotiation in a separate PR. I believe this PR still does what it says and should not negatively affect people using this library (unless they turn on the UTF8 validation mode, which is not really supported anywhere else and will result in errors)

Fair enough. But I think we need to document a bit better what the precise behavior after this PR is and what still has to be done. I'm mostly worried that it could now happen very easily that somebody uses the code to produce metrics rendered in the new UTF-8 capable style after negotiating the existing old formats. (Once content negotiation is fully implemented, the encoders would use the underscore encoding described in the design doc in that case.)

I guess the doc comment for func MetricFamilyToText and func MetricFamilyToOpenMetrics should include how the output changes if metric names or label names are using characters excluded in the legacy format, and that it is the callers responsibility to not include names with those characters if valid legacy format is desired.

beorn7 · 2024-01-09T15:02:16Z

To clarify: Despite the title, this PR doesn't really change the parsing side, does it? I mean, the configurable validation implicitly adds support to the protobuf decoder. But the other parsers don't change. Maybe that could be added as a TODO.

Also, a TODO in the content negotiation part might help to avoid confusion.

beorn7

Codewise LGTM. Just need the hints in doc comments as discussed in the comments.

beorn7 · 2024-01-09T14:46:25Z

model/metric.go

+	"unicode/utf8"
+)
+
+type validationSchemeId int


Go pseudo enums keep confusing me. Maybe it's better to make this type exported? And here or below add doc comments about how the two options change the behavior (which might just be a reference to IsValidMetricName below, like: "See IsValidMetricName for details how the validation changes.")

ywwg · 2024-01-11T14:21:06Z

I guess the doc comment for func MetricFamilyToText and func MetricFamilyToOpenMetrics should include how the output changes if metric names or label names are using characters excluded in the legacy format

the comment already says "this function assumes the input is already sanitized and does not perform any
// sanity checks. If the input contains duplicate metrics or invalid metric or
// label names, the conversion will result in invalid text format output."

Do you want more than that?

ywwg · 2024-01-11T14:22:25Z

oh I get it now, adding...

Signed-off-by: Owen Williams <owen.williams@grafana.com>

beorn7

Thank you very much. And sorry for requesting a few more doc comment tweaks.

About the TODO I would like to see in the content negotiation part: In https://github.com/prometheus/common/blob/main/expfmt/expfmt.go , there should be a TODO that reads something like this: "The content-type strings here are all for the legacy exposition formats, where valid characters for metric names and label names are limited. The support for arbitrary UTF-8 characters in those names is already partially implemented in this module (see model.ValidationScheme), but to actually use it on the wire, new content-type strings have to be agreed upon and added here."

expfmt/openmetrics_create.go

expfmt/text_create.go

beorn7 · 2024-01-12T15:33:29Z

expfmt/text_create.go

@@ -330,32 +324,62 @@ func writeSample(
 	return written, nil
 }

-// writeLabelPairs converts a slice of LabelPair proto messages plus the
+// writeNameAndLabelPairs converts a slice of LabelPair proto messages plus the
 // explicitly given additional label pair into text formatted as required by the
 // text format and writes it to 'w'. An empty slice in combination with an empty
 // string 'additionalLabelName' results in nothing being written. Otherwise, the
 // label pairs are written, escaped as required by the text format, and enclosed
 // in '{...}'. The function returns the number of bytes written and any error
 // encountered.


The doc comment here would benefit from an update for the new features (additionally passed name, special syntax/quoting for name and label names containing formerly invalid characters).

Signed-off-by: Owen Williams <owen.williams@grafana.com>

…ic-name

Signed-off-by: Owen Williams <owen.williams@grafana.com>

beorn7

Thank you very much.

But what do you think about my earlier comment? This one:

About the TODO I would like to see in the content negotiation part: In https://github.com/prometheus/common/blob/main/expfmt/expfmt.go , there should be a TODO that reads something like this: "The content-type strings here are all for the legacy exposition formats, where valid characters for metric names and label names are limited. The support for arbitrary UTF-8 characters in those names is already partially implemented in this module (see model.ValidationScheme), but to actually use it on the wire, new content-type strings have to be agreed upon and added here."

Maybe you just missed it?

Signed-off-by: Owen Williams <owen.williams@grafana.com>

ywwg · 2024-01-22T14:17:00Z

ah, sorry about that! done now

beorn7 · 2024-01-23T15:24:40Z

Thank you very much.

ywwg changed the title ~~testing for quoting things~~ WIP: UTF-8 support Nov 29, 2023

ywwg force-pushed the owilliams/quoted-metric-name branch 2 times, most recently from 7ea4b4c to e2602ac Compare December 4, 2023 18:25

ywwg commented Dec 4, 2023

View reviewed changes

expfmt/text_create.go Outdated Show resolved Hide resolved

ywwg commented Dec 4, 2023

View reviewed changes

expfmt/text_create.go Outdated Show resolved Hide resolved

ywwg marked this pull request as ready for review December 4, 2023 19:56

ywwg requested a review from beorn7 December 4, 2023 19:59

beorn7 requested a review from roidelapluie December 5, 2023 15:18

beorn7 changed the title ~~WIP: UTF-8 support~~ UTF-8 support Dec 5, 2023

ywwg mentioned this pull request Dec 8, 2023

UTF-8: Add partial support for parsing UTF-8 metric and label names prometheus/prometheus#13271

Merged

beorn7 requested changes Dec 13, 2023

View reviewed changes

ywwg changed the title ~~UTF-8 support~~ UTF-8 support in parsers and formatters Dec 14, 2023

ywwg mentioned this pull request Dec 14, 2023

Implement changes needed to support UTF-8 metric and label names in Prometheus #527

Open

4 tasks

ywwg marked this pull request as draft December 15, 2023 21:31

ywwg marked this pull request as ready for review December 28, 2023 19:40

ywwg mentioned this pull request Dec 28, 2023

UTF8: Content negotiation (encoding side) ywwg/common#2

Closed

beorn7 requested changes Jan 9, 2024

View reviewed changes

ywwg force-pushed the owilliams/quoted-metric-name branch from 3078b7f to cbd00f5 Compare January 11, 2024 14:15

ywwg changed the title ~~UTF-8 support in parsers and formatters~~ UTF-8 support in validation, and some parsers and formatters Jan 11, 2024

UTF-8 support in metric and label names

0940b71

Signed-off-by: Owen Williams <owen.williams@grafana.com>

ywwg force-pushed the owilliams/quoted-metric-name branch from cbd00f5 to 0940b71 Compare January 11, 2024 14:24

ywwg mentioned this pull request Jan 11, 2024

Epic: Implement UTF-8 character support for label and metric names prometheus/prometheus#13095

Open

24 tasks

beorn7 requested changes Jan 12, 2024

View reviewed changes

ywwg added 3 commits January 18, 2024 09:44

Address notes

93343e3

Signed-off-by: Owen Williams <owen.williams@grafana.com>

Merge remote-tracking branch 'origin/main' into owilliams/quoted-metr…

98bf5a7

…ic-name

lint

ca48097

Signed-off-by: Owen Williams <owen.williams@grafana.com>

ywwg mentioned this pull request Jan 18, 2024

UTF8: Content negotiation (encoding side) #570

Merged

beorn7 requested changes Jan 18, 2024

View reviewed changes

Add comment about negotiation

d6767c6

Signed-off-by: Owen Williams <owen.williams@grafana.com>

beorn7 approved these changes Jan 23, 2024

View reviewed changes

beorn7 merged commit bd0376d into prometheus:main Jan 23, 2024
6 checks passed

fedetorres93 mentioned this pull request Feb 7, 2024

UTF-8 support in metric and label names prometheus/client_java#922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 support in validation, and some parsers and formatters #537

UTF-8 support in validation, and some parsers and formatters #537

ywwg commented Nov 29, 2023 •

edited

beorn7 commented Dec 5, 2023

beorn7 left a comment

beorn7 commented Dec 13, 2023

ywwg commented Dec 14, 2023

beorn7 commented Dec 14, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

beorn7 commented Dec 15, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

beorn7 commented Dec 19, 2023

ywwg commented Dec 28, 2023

ywwg commented Dec 28, 2023

beorn7 commented Jan 9, 2024

beorn7 commented Jan 9, 2024

beorn7 left a comment

beorn7 Jan 9, 2024

ywwg Jan 11, 2024

ywwg commented Jan 11, 2024

ywwg commented Jan 11, 2024

beorn7 left a comment

beorn7 Jan 12, 2024

ywwg Jan 18, 2024

beorn7 left a comment

ywwg commented Jan 22, 2024

beorn7 commented Jan 23, 2024

UTF-8 support in validation, and some parsers and formatters #537

UTF-8 support in validation, and some parsers and formatters #537

Conversation

ywwg commented Nov 29, 2023 • edited

beorn7 commented Dec 5, 2023

beorn7 left a comment

Choose a reason for hiding this comment

beorn7 commented Dec 13, 2023

ywwg commented Dec 14, 2023

beorn7 commented Dec 14, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

beorn7 commented Dec 15, 2023

ywwg commented Dec 15, 2023

ywwg commented Dec 15, 2023

beorn7 commented Dec 19, 2023

ywwg commented Dec 28, 2023

ywwg commented Dec 28, 2023

beorn7 commented Jan 9, 2024

beorn7 commented Jan 9, 2024

beorn7 left a comment

Choose a reason for hiding this comment

beorn7 Jan 9, 2024

Choose a reason for hiding this comment

ywwg Jan 11, 2024

Choose a reason for hiding this comment

ywwg commented Jan 11, 2024

ywwg commented Jan 11, 2024

beorn7 left a comment

Choose a reason for hiding this comment

beorn7 Jan 12, 2024

Choose a reason for hiding this comment

ywwg Jan 18, 2024

Choose a reason for hiding this comment

beorn7 left a comment

Choose a reason for hiding this comment

ywwg commented Jan 22, 2024

beorn7 commented Jan 23, 2024

ywwg commented Nov 29, 2023 •

edited