Limit custom character class ranges to single scalars #422

natecook1000 · 2022-05-18T23:43:17Z

As shown by issue #401, the standard lexicographic-ordering-based comparisons for characters yield very unexpected results when matching with ranges in custom character classes. This change resolves that unexpected behavior by only matching single-scalar characters within ranges and only allowing single-scalar characters to be range endpoints.

Multi-scalar characters are still allowed within custom character classes, and meta characters and built-in character classes continue to function in the same way as before. With this change, we have the following behavior:

try /[1-2]/.wholeMatch(in: "1")     // "1"
try /[12]/.wholeMatch(in: "1")      // "1"
try /[1-2]/.wholeMatch(in: "1️⃣")    // nil
try /[12]/.wholeMatch(in: "1️⃣")     // nil

try /\d/.wholeMatch(in: "1️⃣")                      // "1️⃣"
try /\d/.asciiOnlyDigits().wholeMatch(in: "1️⃣")    // nil
try /[\d]/.wholeMatch(in: "1️⃣")                    // "1️⃣"
try /[\d]/.asciiOnlyDigits().wholeMatch(in: "1️⃣")  // nil

try /[🇦🇫-🇿🇼]/.wholeMatch(in: "Flags! 🇬🇭🇰🇷")       // error: invalid character class range

Character class ranges don't work well with multi-scalar inputs, in either the range or the matched character. This change limits range endpoints to single-scalar characters and matches only characters that are themselves a single scalar. Fixes issue apple#407, which now displays this behavior: ``` try /[1-2]/.wholeMatch(in: "1️⃣") // nil try /[12]/.wholeMatch(in: "1️⃣") // nil try /(?U)[\d]/.wholeMatch(in: "1️⃣") // nil ```

This applies the current matching semantics for character classes, matching either characters or Unicode scalars depending on the current options.

natecook1000 · 2022-05-18T23:43:40Z

@swift-ci Please test

The prior implementation didn't make a lot of sense, and couldn't handle cases like `/(?i)[X-c]/`. This new approach uses simple case matching to test if the character is within the range, then tests if the uppercase or lowercase mappings are within the range. Fixes apple#395

natecook1000 · 2022-05-19T04:36:44Z

@swift-ci Please test

milseman · 2022-05-19T13:59:03Z

Sources/_RegexParser/Regex/AST/Atom.swift

@@ -771,7 +771,7 @@ extension AST.Atom {
  /// range.
  public var isValidCharacterClassRangeBound: Bool {
    // If we have a literal character value for this, it can be used as a bound.
-    if literalCharacterValue != nil { return true }
+    if literalCharacterValue?.hasExactlyOneScalar == true { return true }


Does normalization affect this?

It does for sure, it would help if we normalized all characters as we stored them in the custom character class model.

Which normalization?

NFC would allow us to permit as many characters as possible to act as endpoints.

milseman · 2022-05-19T13:59:25Z

Sources/_RegexParser/Utility/Misc.swift

@@ -19,6 +19,13 @@ extension Substring {
  var string: String { String(self) }
 }

+extension Character {
+  /// Whether this character is made up of exactly one Unicode scalar value.
+  public var hasExactlyOneScalar: Bool {


Why is this public?

This was previously in the _StringProcessing module; we need it in _RegexParser for the compile-time validation.

milseman · 2022-05-19T14:01:13Z