LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

shengchl · 2020-07-09T15:40:02Z

Hey there!

First of all, thanks for such a useful project. These tiny tools make work with old APIs much less frustrating!

I must admit i didn't manage to use this package in my project yet, so this issue might be irrelevant.

I checked the use of Cyrl/Latn script designators in LocaleID.swift and noticed that an underscore is used instead of a hyphen to separate language designator and script designator:

case sr_Cyrl
case sr_Cyrl_BA
case sr_Latn
case sr_Latn_BA

while in Internationalization and Localization Guide - Appendix B: Language and Locale IDs - Locale IDs there is a following schema for ID's featuring a scrip designator:

[language designator]-[script designator] //az-Arab, zh-Hans
[language designator]-[script designator]_[region designator] //zh-Hans_HK

Notice the use of hyphen to separate language and script designators.

I believe there is possibility that for specific case of locale with a script designator this package will not produce a valid instance.

I may test this case in the future and provide a fix but cannot promise as of now.

Again, thanks for this tiny yet very useful package!

The text was updated successfully, but these errors were encountered:

vincentneo · 2020-07-11T05:25:41Z

Hi!
This is very interesting and new to me actually, so apologies for any mistakes.
I will take a look at this in-depth when I become more free.

shengchl · 2020-07-14T11:45:40Z

Ok, so this little designator turned out to be a rabbit hole and at this point I spent too much time to continue this research. I don't really know if that's any useful for day-to-day development, but knowing the standards maybe useful in case of the need to tailor the localization.

The short answer is that it does not matter what script designator to use as Locale will parse it anyway and resulting Locale instance will have a proper scriptCode. However, the scriptCode will be suppressed in accordance with the BCP47 recommendations for languages that have unambiguous script convention (e.g. even if we pass "en-Latn" to initializer, Latn is excluded from the instance). The proper designator is indeed hyphen as can be seen in Language tags in HTML and XML which maybe useful to know for contexts besides apple platforms.

vincentneo · 2020-07-14T14:55:09Z

This sounds rather interesting.
Does that mean that in essence, "en-Latn" is more or less redundant, and that all should just use "en"?

shengchl · 2020-07-15T13:00:48Z

In most cases – yes, and you may notice that in any of the "en_..." locale id's such as en_US etc. They don't use a script subtag, and it will not be assigned to scriptCode var even if it's present in id (e.g. "en-Latn-US"). The script subtag is not suppressed if it's not a standard script of the language. Continuing the 'en' example, you can define an 'en-Latg' or 'en-Latf
' locales which are sub-families of Latin script. I don't think those locales are used in a digital context so this is a synthetic example. More real world one is Braille scrip, which can be used in certain context with specialized displays. The locale for such script in English would be "en-Brai". You can see the full list of ISO_15924 scripts on this Wiki page.

You can check all the definitions of the languages and scripts in Language subtag Registry. For example, these are the entries for English language and Latin script:

Type: language
Subtag: en
Description: English
Added: 2005-10-16
Suppress-Script: Latn

Type: script
Subtag: Latn
Description: Latin
Added: 2005-10-16

You may notice in the document that for languages that need precise specification of the script there is no Suppress-Script field, e.g. for Chinese you will often specify the Hanzi Traditional or Simplified scripts (Hant/Hans):

Type: language
Subtag: zh
Description: Chinese
Added: 2005-10-16
Scope: macrolanguage

The 'macrolanguage' filed defines Chinese as 'main' language of the broad family, which allows for further specification, such as with Cantonese, which can be used as sub language (e.g. "zh-yue"):

Type: extlang
Subtag: yue
Description: Yue Chinese
Description: Cantonese
Added: 2009-07-29
Preferred-Value: yue
Prefix: zh
Macrolanguage: zh

or as a language by itself (with is 'preffered' way of using it):

Type: language
Subtag: yue
Description: Yue Chinese
Description: Cantonese
Added: 2009-07-29
Macrolanguage: zh

What in my opinion is wired (and most likely a bug, or bad implementation) is that in apple's Locale implementation they do not initialize a scriptCode variable in those 'default script' cases, such as Hans is default for Mandarin, so if we explicitly initialize it with "zh-Hans_CN" the scriptCode is "nil", while it's not nil for the same locale if it is inferred from device preferences. And anyway, this goes somewhat against the standard, as it does not define default scripts for regions, while apple assumes precisely that. In my opinion there is no rationale behind not initializing a scriptCode, it does not take too much space anyway, but can be useful in some cases.

vincentneo · 2020-07-15T14:39:30Z

You can check all the definitions of the languages and scripts in Language subtag Registry

I kinda think that it would be nice if some of the data there is added here as documentation.

this goes somewhat against the standard, as it does not define default scripts for regions, while apple assumes precisely that

I assume you would be referring to this effect, where:
Locale(identifier: "sr-Latn_BA") would have scriptCode as "Latn" and regionCode as "BA"
Locale(identifier: "sr-Cyrl_BA") would have scriptCode as nil and regionCode as "BA", because it assumes "Cyrl" as a default script?

This whole Locale thing really surprised me with how intricate some of these standards are...

Regardless, I think I would still change the underscores, regardless if it currently works or not, as it better represents the standards.

Thanks for spending the time to type out such a long response. Would not have learnt so much about locales and stuff without these message exchanges.

shengchl · 2020-07-16T16:35:04Z

I kinda think that it would be nice if some of the data there is added here as documentation.

I believe it's possible to make a short explanation of the basic idea behind BCP47 based on my post above and linked documents. There is one problem though, that is - I will strive for detailed review and probably will not succeed in providing an actually short message. We can continue this discussion here, and later refine it to something suitable for readme. Maybe at some point we will have enough information for a blog post. At this point I absolutely strive to write a humorous post about translating Xcode project to a Sumerian language with cuniform script. Such locale can easily be construct with sux ISO 639-2 code and Xsux ISO 15924 script code, with "sux-Xsux" resulting id, which is recognised by Xcode, and can be explicitly assigned to app's current locale to fetch localization strings. The little problem is, I need some Sumerian text to replace "Hello, World!" example as I am not sure I will be able to find the word for world in Sumerian (or if they even had such a concept). Maybe "Hail the Emperor" would be easier to find.

I believe such a blog post can be quite entertaining way of explaining the mechanism of providing custom locales, which may be useful even these days. For example, it seems like the suppression of Cantonese as a separate language becomes even more profound nowadays (e.g. see this thread), or the fact that there is no Cantonese as an option in iOS language settings, and that zh_HK (Chinese with traditional Hanzi script) is used for locale in Hong Kong despite that yue_HK is definitely more suitable there. Developers can resist this trend by providing custom locale selector with Cantonese locale as an option. Even if this makes little difference in a written context, this could be a nice gesture towards customers, and even a political signal. The LocaleComplete can be used to make the support of custom locales easier.

Also, speaking how deep this rabbit hole is, we have only discussed BCP language tags as of know. Things become more interesting when we take Unicode into account. The language tags and unicode interplay in some delicate manner as described in 3.3 BCP 47 Conformance of UNICODE LOCALE DATA MARKUP LANGUAGE (LDML), such as there are two BCP 47 (-u and -t) extension managed by Unicode, and the two standards are designed to be compatible to some degree, e.g. there can be a string that's valid BCP47 language Tag and a Unicode BCP 47 locale identifier. But I don't understand that stuff well enough as of yet. One thing is, I believe Xcode can understand any BCP47 language tag, while CFLocale is designed to canonicalize the id, so there can be a situation where you add a valid .lproj folder to Xcode and it infers the locale, but when you pass the same id to Locale initializer it transforms it to a different id, so that lproj is not inferred. That said, the topic is really broad.

shengchl · 2020-07-16T17:10:15Z

I assume you would be referring to this effect

Yes, precisely that. Even if standard calls for elimination of script code from id, i don't think there is reason not to assign it to scriptCode property.

Regardless, I think I would still change the underscores, regardless if it currently works or not, as it better represents the standards.

Great! I think this is about following a good practice of following standards which may play important role in the future in unrelated contexts.

Thanks for spending the time to type out such a long response. Would not have learnt so much about locales and stuff without these message exchanges.

Thanks for spending time on this project as well!

P.S. I believe there is a major point of confusion regarding the region code, as in context of apple platforms it can be a part of the Language ID (e.g. zh-TW), and it can be a part of the region part of the locale id separated by underscore (e.g. zh-TW_TW). Seems like Locale is smart about this and cuts one in favour of another, but this just adds confusion... (or maybe I misunderstand smth)

vincentneo · 2020-07-19T09:11:29Z

Xcode project to a Sumerian language with cuniform script

Sounds like fun.

Great! I think this is about following a good practice of following standards which may play important role in the future in unrelated contexts.

Do note that while I can change the string of each language tag, I cannot change the LocaleID enum cases, as Swift as a language does not allow '-' in a variable name, it seems. (I actually did not know about that until now)
I actually got the codes generated from some playground code that I've written. Time to do some modifications... The code is very messy and highly inefficient, which is the reason why I did not open-source it.

. The language tags and unicode interplay in some delicate manner as described in 3.3 BCP 47 Conformance of UNICODE LOCALE DATA MARKUP LANGUAGE (LDML),

I had a read at the link that you gave, and I noticed that, in contrary to the BCP identifiers, the so called Unicode CLDR locale identifier seems to replace all '-' with '_'. That sounds like the whole problem again... :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

shengchl commented Jul 9, 2020 •

edited

vincentneo commented Jul 11, 2020

shengchl commented Jul 14, 2020

vincentneo commented Jul 14, 2020

shengchl commented Jul 15, 2020

vincentneo commented Jul 15, 2020

shengchl commented Jul 16, 2020

shengchl commented Jul 16, 2020

vincentneo commented Jul 19, 2020

LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

Comments

shengchl commented Jul 9, 2020 • edited

vincentneo commented Jul 11, 2020

shengchl commented Jul 14, 2020

vincentneo commented Jul 14, 2020

shengchl commented Jul 15, 2020

vincentneo commented Jul 15, 2020

shengchl commented Jul 16, 2020

shengchl commented Jul 16, 2020

vincentneo commented Jul 19, 2020

shengchl commented Jul 9, 2020 •

edited