Support handling underscore as a Locale separator on the input #777

zbraniecki · 2023-04-20T18:41:16Z

We currently reject _ as a subtag separator when parsing a locale.

This has come up in unicode-org/icu4x#3336 .

I'm questioning the value of rejecting _ as a subtag separator on the input in favor of following the Robustness Principle.

We already do not follow Unicode BCP47 Locale strictly - we accept uncanonicalized Locale, for example this works:

new Intl.Locale("EN-fr");
new Intl.DateTimeFormat("aR-Pl");

both of those work despite not being canonical, we will canonicalize them on the output!

I suggest we extend support to:

new Intl.Locale("EN_fr");
new Intl.DateTimeFormat("aR_Pl");

The text was updated successfully, but these errors were encountered:

aphillips · 2023-04-20T19:43:30Z

Note that BCP47 is case insensitive, so the call out about that is incorrect: those are all valid tags.

I tend to agree with canonicalizing underscore to hyphen, since it can be confusing.

gibson042 · 2023-04-20T20:31:34Z

I am in general much more inclined towards making decisions based on Hyrum's Law than the Postel Principle, because the latter frequently leads to regret such as probably-permanent commitment to the bizarre behavior of Date parsing (ask me how I know). ECMA-402 has made it this far without BCP-47-incompatible Unicode locale identifiers, and I would see little value in pursuing a backwards incompatible syntax at this late stage.

ptomato · 2023-04-20T22:45:33Z

In GNOME's JS environment, we have a use case for accepting _. Information about the current locale comes from a platform API as there is no navigator object. The platform API comes from C and uses _ as a separator (e.g., en_CA) since that's the format accepted by libc's LC_ALL, LC_TIME, etc. environment variables. If you want to use this locale in Intl APIs, you have to transform the _ to a - yourself, which is easy to do incorrectly if you're not familiar with locales. (e.g., assuming the underscore only occurs at position 2)

sffc · 2023-04-20T23:22:57Z

My personal position started closer to where @gibson042 is coming from, but I found compelling the argument that handling _ is a well defined operation that does not need to complicate a grammar. The only risk is that _ would start being a valid token in a future edition of BCP-47, which I see unlikely; for example, something like en-GB_ENG where GB_ENG would be a region with subdivision. I therefore do not see risk in accepting _ as part of the grammar upon input in these strings.

sffc · 2023-04-20T23:25:12Z

CC @macchiati who (along with @aphillips) is editor of the BCP-47 standard and may have thoughts here. Should Intl accept _ in place of - in locale identifiers that are otherwise interpreted in BCP-47?

aphillips · 2023-04-21T09:09:37Z

@sffc There is no chance that underscore will ever be a valid anything in BCP47. The grammar is purposefully fixed, with pathways available for future expansion via extensions and a few reserved bits in the "normal" grammar.

I do think that the canonical form should never include underscore. but accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore. (@zbraniecki check with Prithvi and Abhijeet for implementation experience and details)

gibson042 · 2023-04-21T17:34:49Z

handling _ is a well defined operation that does not need to complicate a grammar

I disagree that it would not complicate any grammar, because UTS 35 deviates from BCP 47 not just in allowing _ as a substitute for -, but also in allowing root as a special standalone language identifier (which would otherwise be syntactically invalid) and optionally allowing language identifiers to start with a script rather than a language (see BCP 47 Conformance). We'd also need to correct all of ECMA-402 to reference "Unicode locale identifiers" rather than "Unicode BCP 47 locale identifiers" and "language tags", because the latter two explicitly exclude identifiers using BCP 47-incompatible syntax such as underscores.

accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore

Isn't that a good thing to encounter, since such identifiers are not valid for general interchange? This is in fact the biggest issue with the robustness principle—it turns deviations into undocumented or poorly documented shadow requirements that spread infectiously but unpredictably throughout an ecosystem.

sffc · 2023-11-16T22:48:50Z

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2023-11-16.md#support-handling-underscore-as-a-locale-separator-on-the-input-777

Conclusion: the committee didn't feel that there was strong enough motivation to make this change at this time. @FrankYFTang also pointed out that it can be done in userland. If there is more evidence to back up making this change we are happy to reconsider it.

srl295 · 2023-11-17T15:17:03Z

handling _ is a well defined operation that does not need to complicate a grammar

I disagree that it would not complicate any grammar, because UTS 35 deviates from BCP 47 not just in allowing _ as a substitute for -, but also in allowing root as a special standalone language identifier (which would otherwise be syntactically invalid) and optionally allowing language identifiers to start with a script rather than a language (see BCP 47 Conformance). We'd also need to correct all of ECMA-402 to reference "Unicode locale identifiers" rather than "Unicode BCP 47 locale identifiers" and "language tags", because the latter two explicitly exclude identifiers using BCP 47-incompatible syntax such as underscores.

I agree that underscores should NOT be supported. But I would disagree with the unqualified statement that UTS 35 deviates from BCP 47. "Unicode Language and Locale Identifiers" deviate from BCP 47. But UTS 35 does not propose that other processors of BCP 47 (such as ecma402) should also deviate. The introduction to the major section you link to states that "Unicode LDML uses stable identifiers based on BCP47", and the end of the section you link to states that:

There are thus two subtypes of Unicode locale identifiers:

the term Unicode CLDR locale identifier applies where the backwards compatibility syntax is used.

the term Unicode BCP 47 locale identifier applies otherwise. A Unicode BCP 47 locale identifier is also a valid BCP 47 language tag.

Also note that CLDR has a ticket CLDR-15012 Move to BCP47 - CLDR considers the current identifiers to be based on BCP47.

accepting underscores would potentially reduce some tripping hazard for users who, for some reason, expect underscore to work or who have a non-BCP47 interface that produces underscore

Isn't that a good thing to encounter, since such identifiers are not valid for general interchange? This is in fact the biggest issue with the robustness principle—it turns deviations into undocumented or poorly documented shadow requirements that spread infectiously but unpredictably throughout an ecosystem.

100% this. No BCP47 deviations. that will just hurt users.

@ptomato wrote:

In GNOME's JS environment, we have a use case for accepting _. Information about the current locale comes from a platform API as there is no navigator object. The platform API comes from C and uses _ as a separator (e.g., en_CA) since that's the format accepted by libc's LC_ALL, LC_TIME, etc. environment variables. If you want to use this locale in Intl APIs, you have to transform the _ to a - yourself, which is easy to do incorrectly if you're not familiar with locales. (e.g., assuming the underscore only occurs at position 2)

Except it's actually much more complex than that. Saying you can transform _ to - yourself really hurts users here, because the POSIX locale IDs actually require a bit of processing to attempt to get right into ICU locales / Unicode locale identifiers, or into BCP 47. (Ask me how I know!) — you actually should be using something like the ICU code I linked to get an ICU locale, and then convert that into bcp47 using ICU in GNOME's JS environment. Recommending _ to - works in trivial cases, but does other users a disservice.

zbraniecki added s: discuss Status: TG2 must discuss to move forward c: locale Component: locale identifiers labels Apr 20, 2023

sffc added this to Priority Issues in ECMA-402 Meeting Topics Apr 20, 2023

sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Nov 16, 2023

sffc added s: comment Status: more info is needed to move forward and removed s: discuss Status: TG2 must discuss to move forward labels Nov 16, 2023

sffc mentioned this issue Nov 16, 2023

Add support for Unicode BCP 47 locale identifiers unicode-org/icu4x#3336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support handling underscore as a Locale separator on the input #777

Support handling underscore as a Locale separator on the input #777

zbraniecki commented Apr 20, 2023

aphillips commented Apr 20, 2023

gibson042 commented Apr 20, 2023

ptomato commented Apr 20, 2023

sffc commented Apr 20, 2023

sffc commented Apr 20, 2023

aphillips commented Apr 21, 2023

gibson042 commented Apr 21, 2023

sffc commented Nov 16, 2023

srl295 commented Nov 17, 2023

Support handling underscore as a Locale separator on the input #777

Support handling underscore as a Locale separator on the input #777

Comments

zbraniecki commented Apr 20, 2023

aphillips commented Apr 20, 2023

gibson042 commented Apr 20, 2023

ptomato commented Apr 20, 2023

sffc commented Apr 20, 2023

sffc commented Apr 20, 2023

aphillips commented Apr 21, 2023

gibson042 commented Apr 21, 2023

sffc commented Nov 16, 2023

srl295 commented Nov 17, 2023