-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-ASCII characters inside inline i
clause
#80
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This PR looks good to me except the builtin case folding calls.
if (explicitCaseFold) { | ||
const char = String.fromCodePoint(codePoint); | ||
const upperCaseChar = char.toLowerCase().toUpperCase(); | ||
const lowerCaseChar = char.toUpperCase().toLowerCase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling String#toUpperCase
will introduce Node.js version dependent behaviour: For example, when regenerate transpiles /(?:i\u{1e900})/u
, it will yield different result on different Node.js versions:
// Node.js 4
/\u{1e900}/u
// Node.js 12
/[\u{1e900}\u{1e922}]/u
because the Adlam Capital Letter Alif 𞤀 is introduced in Unicode 9.0, which is not supported by Node.js 4.
To properly support non-ASCII case folding we should create a simple case folding map as mentioned in
regexpu-core/scripts/iu-mappings.js
Lines 64 to 81 in d5f4abe
// From <http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt>: | |
// | |
// The status field is: | |
// C: common case folding, common mappings shared by both simple and full | |
// mappings. | |
// F: full case folding, mappings that cause strings to grow in length. Multiple | |
// characters are separated by spaces. | |
// S: simple case folding, mappings to single characters where different from F. | |
// T: special case for uppercase I and dotted uppercase I | |
// - For non-Turkic languages, this mapping is normally not used. | |
// - For Turkic languages (tr, az), this mapping can be used instead of the | |
// normal mapping for these characters. Note that the Turkic mappings do | |
// not maintain canonical equivalence without additional processing. | |
// See the discussions of case mapping in the Unicode Standard for more | |
// information. | |
// | |
// Usage: | |
// A. To do a simple case folding, use the mappings with status C + S. |
Currently we only create iu-mappings
data because we don't really expand the i
case. This is no longer true when we have to support the modifiers.
if (!config.flags.unicode && !config.flags.unicodeSets) return false; | ||
if (!config.transform.unicodeFlag && !config.transform.modifiers) return false; | ||
return Boolean(config.modifiersData.i || config.flags.ignoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!config.flags.unicode && !config.flags.unicodeSets) return false; | |
if (!config.transform.unicodeFlag && !config.transform.modifiers) return false; | |
return Boolean(config.modifiersData.i || config.flags.ignoreCase); | |
if ((config.flags.unicode || config.flags.unicodeSets) && config.modifiersData.i === true && config.transform.modifiers) return true; | |
if (!config.transform.unicodeFlag) return false; | |
return Boolean(config.flags.ignoreCase); |
I think it suffices to check the modifier when either u
or v
flag is enabled.
'pattern': '(?i:[Жщ])', | ||
'options': { modifiers: 'transform' }, | ||
'expected': '(?:[\\u0416\\u0429\\u0436\\u0449])', | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for (?i:\u{10570})
? It should return (?:[\u{10570}\u{10597}])
.
The Node.js 6 compat test should catch the aforementioned issue, which we should avoid.
Okay, so after perusing the code a bit longer, I believe I've finally figured it out. This fixes #79
The
configNeedCaseFoldUnicode
function, which indicates whether to take into account the implicit case folding of a string happening while setting bothi
andu
flags, used to ignorei
modifier being removed in a rewritten expression.Inside the
caseFold
function, uppercase and lowercase versions of a symbol were only included if the symbol lied inside the ASCII range. Now, any symbol that case folds uniquely is processed properly.Inside the
computeCharacterClass
function,data
wasn't marked as transformed, despite having been augmented with case folded symbols, and therefore didn't alter the resulting expression.At last, I renamed
configNeedCaseFoldUnicode
toconfigNeedAccountForImplicitCaseFold
, as the function indicates if weneed to take the
iu
implicit case folding into account, andconfigNeedCaseFoldAscii
toconfigNeedExplicitCaseFold
, as the characters affected by it are no longer the ASCII ones.Also, I've written several simple tests.
What do you think?