From 3b4110da4da0b4c4c99ff6cede4cef2c2a548758 Mon Sep 17 00:00:00 2001 From: Lee Byron Date: Tue, 13 Apr 2021 02:28:14 -0700 Subject: [PATCH] RFC: Allow full unicode range This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!) --- spec/Appendix B -- Grammar Summary.md | 11 ++++++++-- spec/Section 2 -- Language.md | 31 ++++++++++++++++++++++----- 2 files changed, 35 insertions(+), 7 deletions(-) diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md index 01c900a3f..b365c573f 100644 --- a/spec/Appendix B -- Grammar Summary.md +++ b/spec/Appendix B -- Grammar Summary.md @@ -6,7 +6,7 @@ SourceCharacter :: - "U+0009" - "U+000A" - "U+000D" - - "U+0020–U+FFFF" + - "U+0020–U+10FFFF" ## Ignored Tokens @@ -101,7 +101,14 @@ StringCharacter :: - `\u` EscapedUnicode - `\` EscapedCharacter -EscapedUnicode :: /[0-9A-Fa-f]{4}/ +EscapedUnicode :: + - HexDigit HexDigit HexDigit HexDigit + - `{` HexDigit+ `}` "but only if <= 0x10FFFF" + +HexDigit :: one of + - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` + - `A` `B` `C` `D` `E` `F` + - `a` `b` `c` `d` `e` `f` EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t` diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md index 18993170a..3db9d320f 100644 --- a/spec/Section 2 -- Language.md +++ b/spec/Section 2 -- Language.md @@ -50,7 +50,7 @@ SourceCharacter :: - "U+0009" - "U+000A" - "U+000D" - - "U+0020–U+FFFF" + - "U+0020–U+10FFFF" GraphQL documents are expressed as a sequence of [Unicode](https://unicode.org/standard/standard.html) code points (informally @@ -809,7 +809,14 @@ StringCharacter :: - `\u` EscapedUnicode - `\` EscapedCharacter -EscapedUnicode :: /[0-9A-Fa-f]{4}/ +EscapedUnicode :: + - HexDigit HexDigit HexDigit HexDigit + - `{` HexDigit+ `}` "but only if <= 0x10FFFF" + +HexDigit :: one of + - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` + - `A` `B` `C` `D` `E` `F` + - `a` `b` `c` `d` `e` `f` EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t` @@ -893,7 +900,20 @@ StringValue :: `""` StringValue :: `"` StringCharacter+ `"` - * Return the sequence of all {StringCharacter} code points. + * Let {string} be the sequence of all {StringCharacter} code points. + * For each {codePoint} at {index} in {string}: + * If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)): + * Let {lowPoint} be the code point at {index} + {1} in {string}. + * Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)). + * Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000. + * Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}. + * Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)). + * Return {string}. + +Note: {StringValue} should avoid encoding code points as surrogate pairs. +While services must interpret them accordingly, a braced escape (for example +`"\u{1F4A9}"`) is a clearer way to encode code points outside of the +[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp). StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator @@ -901,8 +921,9 @@ StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator StringCharacter :: `\u` EscapedUnicode - * Let {value} be the 16-bit hexadecimal value represented by the sequence of - hexadecimal digits within {EscapedUnicode}. + * Let {value} be the 21-bit hexadecimal value represented by the sequence of + {HexDigit} within {EscapedUnicode}. + * Assert {value} <= 0x10FFFF. * Return the code point {value}. StringCharacter :: `\` EscapedCharacter