Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII characters inside inline i clause #80

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stulov
Copy link

@stulov stulov commented Jul 23, 2023

Okay, so after perusing the code a bit longer, I believe I've finally figured it out. This fixes #79

The configNeedCaseFoldUnicode function, which indicates whether to take into account the implicit case folding of a string happening while setting both i and u flags, used to ignore i modifier being removed in a rewritten expression.

Inside the caseFold function, uppercase and lowercase versions of a symbol were only included if the symbol lied inside the ASCII range. Now, any symbol that case folds uniquely is processed properly.

Inside the computeCharacterClass function, data wasn't marked as transformed, despite having been augmented with case folded symbols, and therefore didn't alter the resulting expression.

At last, I renamed configNeedCaseFoldUnicode to configNeedAccountForImplicitCaseFold, as the function indicates if we
need to take the iu implicit case folding into account, and configNeedCaseFoldAscii to configNeedExplicitCaseFold, as the characters affected by it are no longer the ASCII ones.

Also, I've written several simple tests.

What do you think?

Copy link
Collaborator

@JLHwung JLHwung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This PR looks good to me except the builtin case folding calls.

if (explicitCaseFold) {
const char = String.fromCodePoint(codePoint);
const upperCaseChar = char.toLowerCase().toUpperCase();
const lowerCaseChar = char.toUpperCase().toLowerCase();
Copy link
Collaborator

@JLHwung JLHwung Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling String#toUpperCase will introduce Node.js version dependent behaviour: For example, when regenerate transpiles /(?:i\u{1e900})/u, it will yield different result on different Node.js versions:

// Node.js 4
/\u{1e900}/u
// Node.js 12
/[\u{1e900}\u{1e922}]/u

because the Adlam Capital Letter Alif 𞤀 is introduced in Unicode 9.0, which is not supported by Node.js 4.

To properly support non-ASCII case folding we should create a simple case folding map as mentioned in

// From <http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt>:
//
// The status field is:
// C: common case folding, common mappings shared by both simple and full
// mappings.
// F: full case folding, mappings that cause strings to grow in length. Multiple
// characters are separated by spaces.
// S: simple case folding, mappings to single characters where different from F.
// T: special case for uppercase I and dotted uppercase I
// - For non-Turkic languages, this mapping is normally not used.
// - For Turkic languages (tr, az), this mapping can be used instead of the
// normal mapping for these characters. Note that the Turkic mappings do
// not maintain canonical equivalence without additional processing.
// See the discussions of case mapping in the Unicode Standard for more
// information.
//
// Usage:
// A. To do a simple case folding, use the mappings with status C + S.

Currently we only create iu-mappings data because we don't really expand the i case. This is no longer true when we have to support the modifiers.

Comment on lines +140 to 142
if (!config.flags.unicode && !config.flags.unicodeSets) return false;
if (!config.transform.unicodeFlag && !config.transform.modifiers) return false;
return Boolean(config.modifiersData.i || config.flags.ignoreCase);
Copy link
Collaborator

@JLHwung JLHwung Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!config.flags.unicode && !config.flags.unicodeSets) return false;
if (!config.transform.unicodeFlag && !config.transform.modifiers) return false;
return Boolean(config.modifiersData.i || config.flags.ignoreCase);
if ((config.flags.unicode || config.flags.unicodeSets) && config.modifiersData.i === true && config.transform.modifiers) return true;
if (!config.transform.unicodeFlag) return false;
return Boolean(config.flags.ignoreCase);

I think it suffices to check the modifier when either u or v flag is enabled.

'pattern': '(?i:[Жщ])',
'options': { modifiers: 'transform' },
'expected': '(?:[\\u0416\\u0429\\u0436\\u0449])',
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case for (?i:\u{10570})? It should return (?:[\u{10570}\u{10597}]).

The Node.js 6 compat test should catch the aforementioned issue, which we should avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline IgnoreCase modifier does not work with non-ASCII
2 participants