Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial comments on character definitions #28

Open
r12a opened this issue Aug 9, 2018 · 3 comments
Open

Editorial comments on character definitions #28

r12a opened this issue Aug 9, 2018 · 3 comments
Assignees

Comments

@r12a
Copy link
Contributor

r12a commented Aug 9, 2018

  1. Characters
    https://w3c.github.io/bp-i18n-specdev/#characters

These are comments on the text recently added to the start of section 4.

[1] 1st occurrence of 'character' not highlighted same as other definitions

[2] The first para, and probably all the rest, should be under "Choosing a definition of character" subsection.

[3] there's a conflation of 'glyph', 'grapheme', and 'user-perceived character' (UPC) which i think is incorrect. A given UPC can be represented by different glyphs, eg. regular, italic, bold, alternative font, etc. Also, a single UPC can be represented by multiple glyphs.

[4] UPC is actually coterminous with the linguistic term 'grapheme', but graphemes are NOT 'visual units found in fonts and rendering software' - those are 'grapheme clusters' (an approximation to the concept of a grapheme expressed using rules defined by TUS).

[5] What's an 'individual rendering unit'?

[6] We should also mention the CSS term 'typographic character unit', see https://drafts.csswg.org/css-text-3/#characters.

[7] This is incorrect.

It shouldn't be possible to cursor into the "middle" of a grapheme or delete only a part of user-perceived character.

It is standard to backwards delete codepoints, but to forward delete grapheme clusters.

[8]

When referring to 'graphemes' in this document, we mean extended grapheme clusters (unless otherwise called out).

Please don't munge those two terms. A grapheme cluster is a mechanical approximation to a grapheme. (And note that they are defined separately in the Unicode glossary.)

Looking at the section "Choosing a definition of character", i think it could do with some reordering. I'll submit a PR for that, because i think it will make it easier to integrate the text above.

@r12a
Copy link
Contributor Author

r12a commented Aug 9, 2018

Here are definitions in the Unicode glossary. I usually find that these are pretty clear and reliable, and so worth relying on for our own needs.

Character https://www.unicode.org/glossary/#character

(1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

Character encoding form https://www.unicode.org/glossary/#character_encoding_form

Mapping from a character set definition to the actual code units used to represent the data.

Character set https://www.unicode.org/glossary/#character_set

A collection of elements used to represent textual information.

Code point https://www.unicode.org/glossary/#code_point

  1. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Code unit https://www.unicode.org/glossary/#code_unit

The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Extended grapheme cluster https://www.unicode.org/glossary/#extended_grapheme_cluster

The text between extended grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation." Abbreviated as EGC. (See definition D61 in Section 3.6, Combination.)

Glyph https://www.unicode.org/glossary/#glyph

(1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing. (See also character.)

Glyph image https://www.unicode.org/glossary/#glyph_image

The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface.

Grapheme https://www.unicode.org/glossary/#grapheme

(1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter a and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character.

Grapheme cluster https://www.unicode.org/glossary/#grapheme_cluster

The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation." (See definition D60 in Section 3.6, Combination.) A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.

User-perceived character https://www.unicode.org/glossary/#user_perceived_character

What everyone thinks of as a character in their script.

@r12a
Copy link
Contributor Author

r12a commented Aug 9, 2018

CSS terms.

Typographic character unit https://drafts.csswg.org/css-text-3/#typographic-character-unit

For text layout, we will refer to the typographic character unit as the basic unit of text. Even within the realm of text layout, the relevant character unit depends on the operation. For example, line-breaking and letter-spacing will segment a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently; or the behaviour of a conjunct consonant in a script such as Devanagari may depend on the font in use. So the typographic character represents a unit of the writing system— such as a Latin alphabetic letter (including its diacritics), Hangul syllable, Chinese ideographic character, Myanmar syllable cluster— that is indivisible with respect to a particular typographic operation (line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).

Typographic letter unit (letter) https://drafts.csswg.org/css-text-3/#typographic-letter-unit

A typographic letter unit or letter for the purpose of this specification is a typographic character unit belonging to one of the Letter or Number general categories in Unicode. [UAX44] See Character Properties for how to determine the Unicode properties of a typographic character unit.

@aphillips aphillips assigned r12a and unassigned r12a May 12, 2022
@aphillips
Copy link
Contributor

I will review this in detail. I suspect this is done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants