Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocaleID's with Cyrl/Latn script designators are designed not in accordance with documentation #1

Open
shengchl opened this issue Jul 9, 2020 · 8 comments

Comments

@shengchl
Copy link

shengchl commented Jul 9, 2020

Hey there!

First of all, thanks for such a useful project. These tiny tools make work with old APIs much less frustrating!

I must admit i didn't manage to use this package in my project yet, so this issue might be irrelevant.

I checked the use of Cyrl/Latn script designators in LocaleID.swift and noticed that an underscore is used instead of a hyphen to separate language designator and script designator:

case sr_Cyrl
case sr_Cyrl_BA
case sr_Latn
case sr_Latn_BA

while in Internationalization and Localization Guide - Appendix B: Language and Locale IDs - Locale IDs there is a following schema for ID's featuring a scrip designator:

[language designator]-[script designator] //az-Arab, zh-Hans
[language designator]-[script designator]_[region designator] //zh-Hans_HK

Notice the use of hyphen to separate language and script designators.

I believe there is possibility that for specific case of locale with a script designator this package will not produce a valid instance.

I may test this case in the future and provide a fix but cannot promise as of now.

Again, thanks for this tiny yet very useful package!

@vincentneo
Copy link
Owner

Hi!
This is very interesting and new to me actually, so apologies for any mistakes.
I will take a look at this in-depth when I become more free.

@shengchl
Copy link
Author

Ok, so this little designator turned out to be a rabbit hole and at this point I spent too much time to continue this research. I don't really know if that's any useful for day-to-day development, but knowing the standards maybe useful in case of the need to tailor the localization.

The short answer is that it does not matter what script designator to use as Locale will parse it anyway and resulting Locale instance will have a proper scriptCode. However, the scriptCode will be suppressed in accordance with the BCP47 recommendations for languages that have unambiguous script convention (e.g. even if we pass "en-Latn" to initializer, Latn is excluded from the instance). The proper designator is indeed hyphen as can be seen in Language tags in HTML and XML which maybe useful to know for contexts besides apple platforms.

@vincentneo
Copy link
Owner

This sounds rather interesting.
Does that mean that in essence, "en-Latn" is more or less redundant, and that all should just use "en"?

@shengchl
Copy link
Author

In most cases – yes, and you may notice that in any of the "en_..." locale id's such as en_US etc. They don't use a script subtag, and it will not be assigned to scriptCode var even if it's present in id (e.g. "en-Latn-US"). The script subtag is not suppressed if it's not a standard script of the language. Continuing the 'en' example, you can define an 'en-Latg' or 'en-Latf
' locales which are sub-families of Latin script. I don't think those locales are used in a digital context so this is a synthetic example. More real world one is Braille scrip, which can be used in certain context with specialized displays. The locale for such script in English would be "en-Brai". You can see the full list of ISO_15924 scripts on this Wiki page.

You can check all the definitions of the languages and scripts in Language subtag Registry. For example, these are the entries for English language and Latin script:

Type: language
Subtag: en
Description: English
Added: 2005-10-16
Suppress-Script: Latn
Type: script
Subtag: Latn
Description: Latin
Added: 2005-10-16

You may notice in the document that for languages that need precise specification of the script there is no Suppress-Script field, e.g. for Chinese you will often specify the Hanzi Traditional or Simplified scripts (Hant/Hans):

Type: language
Subtag: zh
Description: Chinese
Added: 2005-10-16
Scope: macrolanguage

The 'macrolanguage' filed defines Chinese as 'main' language of the broad family, which allows for further specification, such as with Cantonese, which can be used as sub language (e.g. "zh-yue"):

Type: extlang
Subtag: yue
Description: Yue Chinese
Description: Cantonese
Added: 2009-07-29
Preferred-Value: yue
Prefix: zh
Macrolanguage: zh

or as a language by itself (with is 'preffered' way of using it):

Type: language
Subtag: yue
Description: Yue Chinese
Description: Cantonese
Added: 2009-07-29
Macrolanguage: zh

What in my opinion is wired (and most likely a bug, or bad implementation) is that in apple's Locale implementation they do not initialize a scriptCode variable in those 'default script' cases, such as Hans is default for Mandarin, so if we explicitly initialize it with "zh-Hans_CN" the scriptCode is "nil", while it's not nil for the same locale if it is inferred from device preferences. And anyway, this goes somewhat against the standard, as it does not define default scripts for regions, while apple assumes precisely that. In my opinion there is no rationale behind not initializing a scriptCode, it does not take too much space anyway, but can be useful in some cases.

@vincentneo
Copy link
Owner

You can check all the definitions of the languages and scripts in Language subtag Registry

I kinda think that it would be nice if some of the data there is added here as documentation.

this goes somewhat against the standard, as it does not define default scripts for regions, while apple assumes precisely that

I assume you would be referring to this effect, where:
Locale(identifier: "sr-Latn_BA") would have scriptCode as "Latn" and regionCode as "BA"
Locale(identifier: "sr-Cyrl_BA") would have scriptCode as nil and regionCode as "BA", because it assumes "Cyrl" as a default script?

This whole Locale thing really surprised me with how intricate some of these standards are...

Regardless, I think I would still change the underscores, regardless if it currently works or not, as it better represents the standards.

Thanks for spending the time to type out such a long response. Would not have learnt so much about locales and stuff without these message exchanges.

@shengchl
Copy link
Author

I kinda think that it would be nice if some of the data there is added here as documentation.

I believe it's possible to make a short explanation of the basic idea behind BCP47 based on my post above and linked documents. There is one problem though, that is - I will strive for detailed review and probably will not succeed in providing an actually short message. We can continue this discussion here, and later refine it to something suitable for readme. Maybe at some point we will have enough information for a blog post. At this point I absolutely strive to write a humorous post about translating Xcode project to a Sumerian language with cuniform script. Such locale can easily be construct with sux ISO 639-2 code and Xsux ISO 15924 script code, with "sux-Xsux" resulting id, which is recognised by Xcode, and can be explicitly assigned to app's current locale to fetch localization strings. The little problem is, I need some Sumerian text to replace "Hello, World!" example as I am not sure I will be able to find the word for world in Sumerian (or if they even had such a concept). Maybe "Hail the Emperor" would be easier to find.

I believe such a blog post can be quite entertaining way of explaining the mechanism of providing custom locales, which may be useful even these days. For example, it seems like the suppression of Cantonese as a separate language becomes even more profound nowadays (e.g. see this thread), or the fact that there is no Cantonese as an option in iOS language settings, and that zh_HK (Chinese with traditional Hanzi script) is used for locale in Hong Kong despite that yue_HK is definitely more suitable there. Developers can resist this trend by providing custom locale selector with Cantonese locale as an option. Even if this makes little difference in a written context, this could be a nice gesture towards customers, and even a political signal. The LocaleComplete can be used to make the support of custom locales easier.

Also, speaking how deep this rabbit hole is, we have only discussed BCP language tags as of know. Things become more interesting when we take Unicode into account. The language tags and unicode interplay in some delicate manner as described in 3.3 BCP 47 Conformance of UNICODE LOCALE DATA MARKUP LANGUAGE (LDML), such as there are two BCP 47 (-u and -t) extension managed by Unicode, and the two standards are designed to be compatible to some degree, e.g. there can be a string that's valid BCP47 language Tag and a Unicode BCP 47 locale identifier. But I don't understand that stuff well enough as of yet. One thing is, I believe Xcode can understand any BCP47 language tag, while CFLocale is designed to canonicalize the id, so there can be a situation where you add a valid .lproj folder to Xcode and it infers the locale, but when you pass the same id to Locale initializer it transforms it to a different id, so that lproj is not inferred. That said, the topic is really broad.

@shengchl
Copy link
Author

I assume you would be referring to this effect

Yes, precisely that. Even if standard calls for elimination of script code from id, i don't think there is reason not to assign it to scriptCode property.

Regardless, I think I would still change the underscores, regardless if it currently works or not, as it better represents the standards.

Great! I think this is about following a good practice of following standards which may play important role in the future in unrelated contexts.

Thanks for spending the time to type out such a long response. Would not have learnt so much about locales and stuff without these message exchanges.

Thanks for spending time on this project as well!

P.S. I believe there is a major point of confusion regarding the region code, as in context of apple platforms it can be a part of the Language ID (e.g. zh-TW), and it can be a part of the region part of the locale id separated by underscore (e.g. zh-TW_TW). Seems like Locale is smart about this and cuts one in favour of another, but this just adds confusion... (or maybe I misunderstand smth)

@vincentneo
Copy link
Owner

Xcode project to a Sumerian language with cuniform script

Sounds like fun.

Great! I think this is about following a good practice of following standards which may play important role in the future in unrelated contexts.

Do note that while I can change the string of each language tag, I cannot change the LocaleID enum cases, as Swift as a language does not allow '-' in a variable name, it seems. (I actually did not know about that until now)
I actually got the codes generated from some playground code that I've written. Time to do some modifications... The code is very messy and highly inefficient, which is the reason why I did not open-source it.

. The language tags and unicode interplay in some delicate manner as described in 3.3 BCP 47 Conformance of UNICODE LOCALE DATA MARKUP LANGUAGE (LDML),

I had a read at the link that you gave, and I noticed that, in contrary to the BCP identifiers, the so called Unicode CLDR locale identifier seems to replace all '-' with '_'. That sounds like the whole problem again... :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants