Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[articles/idn-and-iri/index] Information in the multilingual web addresses article needs to be updated #564

Open
xfq opened this issue Dec 4, 2023 · 4 comments

Comments

@xfq
Copy link
Member

xfq commented Dec 4, 2023

[source] (https://www.w3.org/International/articles/idn-and-iri/) [en]

Some information in this article needs to be updated, like:

  1. Only URI (RFC3986) and IRI (RFC3987) are mentioned in the article. We might want to add information about the WHATWG URL Standard.
  2. We should update the HTML 4.0 example to "HTML".
  3. We should update the links to the RFC specifications to point to https://www.rfc-editor.org/
  4. "top level domains" should be "top-level domains"

In this case, if we were to use percent-escaping to transform the (same) characters in the address so that they to conform to the URI requirements, we would base the escapes on the bytes that represent 引き割り.html in Shift-JIS.

  1. "they to conform to" above should be "they conform to".
  2. mod_fileiri looks unmaintained, should we keep the reference to it?
  3. The reference to Internet Explorer and Netscape should probably be removed.

You can run a basic check to see whether IDNs work on your system using this simple test.

  1. ^ There should be a more up-to-date test.

Different browsers use different strategies to determine whether the URI should be shown in Unicode or punycode.

  1. "URI" should be "IRI" instead above?
  2. The handling of IDNs by different browsers is mentioned, but we should link to some more updated resources like https://chromium.googlesource.com/chromium/src/+/main/docs/idn.md and https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm

There is a similar issue with the use of simplified vs. traditional characters in the Chinese Han script.

  1. This isn't a huge problem, because if a character isn't unified, most people who know Simplified Chinese or Traditional Chinese can easily see the difference. The bigger problem are things like Kangxi radicals (such as U+2F04 乙 and U+4E59 乙) and duplicate encoded characters (such as 㘽 U+363D and 㦳 U+39B3), because the glyphs are often the same. Also, some registries solves this by making the simplified and traditional characters equivalent (see 1 and 2).

There are some improvements needed to the specifications for IDN and IRIs, and these are currently being discussed. For example, there is a need to extend the range of Unicode characters that can be used in domain names to cover later versions of Unicode, and to allow combining characters at the end of labels in right to left scripts.

  1. What's the status of this? Is this essentially IDNA2008? ^
  2. "ICANN Guidelines for the Implementation of Internationalized Domain Names Version 2.1" should be updated. There is now a newer version.
  3. The link to "IDN and IRI test pages" has been moved.
  4. The link to "IDN-enabled TLDs supported by Mozilla.org" should be updated.
  5. It might be useful to add or link to related information about the differences between IDNA2003, IDNA2008, and UTS #46. For example, 2003 is locked to Unicode version 3.2, while 2008 supports code points that appear in new versions of Unicode; 2003 normalizes ß to ss while 2008 makes it a valid character.
  6. A link to UTS #46 should be added in the Further Reading section.

Examples of registered IDNs

IDN and URI [PDF], Michel Suignard

Opera International Domain Name support

Safari International Domain Name support

  1. These four links are broken. ^

I can raise a PR to fix some of the issues above.

@r12a
Copy link
Contributor

r12a commented Dec 5, 2023

And the article needs to be converted to the latest template, if i remember correctly.

@xfq
Copy link
Member Author

xfq commented Dec 6, 2023

Looks like it already uses the latest template.

@xfq
Copy link
Member Author

xfq commented Dec 6, 2023

https://www.w3.org/International/techniques/authoring-html#iris also needs to be updated, such as adding a link to the WHATWG URL Standard.

@xfq
Copy link
Member Author

xfq commented Dec 8, 2023

Also, I wonder if it's useful to talk about the dot between labels.

For example, in .中国 domain names, U+002E FULL STOP [.], U+3002 IDEOGRAPHIC FULL STOP [。], and U+FF0E FULLWIDTH FULL STOP [.] are fully equivalent:

  • 互联网中心.中国
  • 互联网中心。中国
  • 互联网中心.中国

Entering all three URLs will take you to the same site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants