Skip to content

Commit

Permalink
RFC: Allow full unicode range
Browse files Browse the repository at this point in the history
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
  • Loading branch information
leebyron committed Apr 15, 2021
1 parent d4777b4 commit 3b4110d
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 7 deletions.
11 changes: 9 additions & 2 deletions spec/Appendix B -- Grammar Summary.md
Expand Up @@ -6,7 +6,7 @@ SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"
- "U+0020–U+10FFFF"


## Ignored Tokens
Expand Down Expand Up @@ -101,7 +101,14 @@ StringCharacter ::
- `\u` EscapedUnicode
- `\` EscapedCharacter

EscapedUnicode :: /[0-9A-Fa-f]{4}/
EscapedUnicode ::
- HexDigit HexDigit HexDigit HexDigit
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"

HexDigit :: one of
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
- `A` `B` `C` `D` `E` `F`
- `a` `b` `c` `d` `e` `f`

EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`

Expand Down
31 changes: 26 additions & 5 deletions spec/Section 2 -- Language.md
Expand Up @@ -50,7 +50,7 @@ SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"
- "U+0020–U+10FFFF"

GraphQL documents are expressed as a sequence of
[Unicode](https://unicode.org/standard/standard.html) code points (informally
Expand Down Expand Up @@ -809,7 +809,14 @@ StringCharacter ::
- `\u` EscapedUnicode
- `\` EscapedCharacter

EscapedUnicode :: /[0-9A-Fa-f]{4}/
EscapedUnicode ::
- HexDigit HexDigit HexDigit HexDigit
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"

HexDigit :: one of
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
- `A` `B` `C` `D` `E` `F`
- `a` `b` `c` `d` `e` `f`

EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`

Expand Down Expand Up @@ -893,16 +900,30 @@ StringValue :: `""`

StringValue :: `"` StringCharacter+ `"`

* Return the sequence of all {StringCharacter} code points.
* Let {string} be the sequence of all {StringCharacter} code points.
* For each {codePoint} at {index} in {string}:
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
* Let {lowPoint} be the code point at {index} + {1} in {string}.
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
* Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
* Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
* Return {string}.

Note: {StringValue} should avoid encoding code points as surrogate pairs.
While services must interpret them accordingly, a braced escape (for example
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).

StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator

* Return the code point {SourceCharacter}.

StringCharacter :: `\u` EscapedUnicode

* Let {value} be the 16-bit hexadecimal value represented by the sequence of
hexadecimal digits within {EscapedUnicode}.
* Let {value} be the 21-bit hexadecimal value represented by the sequence of
{HexDigit} within {EscapedUnicode}.
* Assert {value} <= 0x10FFFF.
* Return the code point {value}.

StringCharacter :: `\` EscapedCharacter
Expand Down

0 comments on commit 3b4110d

Please sign in to comment.