Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

legacy grapheme clusters vs extended grapheme clusters #1

Open
frivoal opened this issue Oct 20, 2015 · 4 comments
Open

legacy grapheme clusters vs extended grapheme clusters #1

frivoal opened this issue Oct 20, 2015 · 4 comments

Comments

@frivoal
Copy link

frivoal commented Oct 20, 2015

"Grapheme cluster" is often the appropriate way to define "character" in a specifications (such as CSS) which care about things readers visually identify as a character.

Maybe the spec should point that out, with a link to the relevant part of unicode (http://unicode.org/reports/tr29/ I presume). There is already a mention of that in the "Indexing strings" section, but not in the "Choosing a definition of 'character'" section, where it would be particularly relevant.

Also, providing a specific definition requires picking between "legacy grapheme clusters" and "extended grapheme clusters", and I am not sure how to do that. Guidance on this topic would be appreciated.

@r12a
Copy link
Contributor

r12a commented Oct 20, 2015

Good points, Florian. I'll look at adding that information.

We usually recommend extended grapheme clusters only.

@frivoal
Copy link
Author

frivoal commented Oct 20, 2015

That's typically been what I've guessed should be the correct answer, but without really knowing why. And this specification looks like a great place to enlighten people in my situation.

@aphillips
Copy link
Contributor

Is this addressed by the introduction to section 4?

aphillips added a commit that referenced this issue Sep 30, 2022
merge w3c changes to my branch
@xfq
Copy link
Member

xfq commented Dec 19, 2023

There is no mention of legacy grapheme clusters in specdev at the moment and I think this paragraph in UAX #29 answers Florian's question:

An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. The continuing characters are extended to include all spacing combining marks, such as the spacing (but dependent) vowel signs in Indic scripts. For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters.

IMHO this kind of detail should be mentioned by charmod, not in specdev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants