Support non-ASCII characters inside inline `i` clause #80

stulov · 2023-07-23T19:00:53Z

Okay, so after perusing the code a bit longer, I believe I've finally figured it out. This fixes #79

The configNeedCaseFoldUnicode function, which indicates whether to take into account the implicit case folding of a string happening while setting both i and u flags, used to ignore i modifier being removed in a rewritten expression.

Inside the caseFold function, uppercase and lowercase versions of a symbol were only included if the symbol lied inside the ASCII range. Now, any symbol that case folds uniquely is processed properly.

Inside the computeCharacterClass function, data wasn't marked as transformed, despite having been augmented with case folded symbols, and therefore didn't alter the resulting expression.

At last, I renamed configNeedCaseFoldUnicode to configNeedAccountForImplicitCaseFold, as the function indicates if we
need to take the iu implicit case folding into account, and configNeedCaseFoldAscii to configNeedExplicitCaseFold, as the characters affected by it are no longer the ASCII ones.

Also, I've written several simple tests.

What do you think?

JLHwung

Thanks! This PR looks good to me except the builtin case folding calls.

JLHwung · 2023-09-19T15:18:21Z

rewrite-pattern.js

+	if (explicitCaseFold) {
+		const char = String.fromCodePoint(codePoint);
+		const upperCaseChar = char.toLowerCase().toUpperCase();
+		const lowerCaseChar = char.toUpperCase().toLowerCase();


Calling String#toUpperCase will introduce Node.js version dependent behaviour: For example, when regenerate transpiles /(?:i\u{1e900})/u, it will yield different result on different Node.js versions:

// Node.js 4 /\u{1e900}/u // Node.js 12 /[\u{1e900}\u{1e922}]/u

because the Adlam Capital Letter Alif 𞤀 is introduced in Unicode 9.0, which is not supported by Node.js 4.

To properly support non-ASCII case folding we should create a simple case folding map as mentioned in

regexpu-core/scripts/iu-mappings.js

Lines 64 to 81 in d5f4abe

// From <http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt>:

//

// The status field is:

// C: common case folding, common mappings shared by both simple and full

// mappings.

// F: full case folding, mappings that cause strings to grow in length. Multiple

// characters are separated by spaces.

// S: simple case folding, mappings to single characters where different from F.

// T: special case for uppercase I and dotted uppercase I

// - For non-Turkic languages, this mapping is normally not used.

// - For Turkic languages (tr, az), this mapping can be used instead of the

// normal mapping for these characters. Note that the Turkic mappings do

// not maintain canonical equivalence without additional processing.

// See the discussions of case mapping in the Unicode Standard for more

// information.

//

// Usage:

// A. To do a simple case folding, use the mappings with status C + S.

Currently we only create iu-mappings data because we don't really expand the i case. This is no longer true when we have to support the modifiers.

JLHwung · 2023-09-19T15:21:34Z

rewrite-pattern.js

+	if (!config.flags.unicode && !config.flags.unicodeSets) return false;
+	if (!config.transform.unicodeFlag && !config.transform.modifiers) return false;
 	return Boolean(config.modifiersData.i || config.flags.ignoreCase);


Suggested change

if (!config.flags.unicode && !config.flags.unicodeSets) return false;

if (!config.transform.unicodeFlag && !config.transform.modifiers) return false;

return Boolean(config.modifiersData.i || config.flags.ignoreCase);

if ((config.flags.unicode || config.flags.unicodeSets) && config.modifiersData.i === true && config.transform.modifiers) return true;

if (!config.transform.unicodeFlag) return false;

return Boolean(config.flags.ignoreCase);

I think it suffices to check the modifier when either u or v flag is enabled.

JLHwung · 2023-09-19T15:54:42Z

tests/tests.js

+		'pattern': '(?i:[Жщ])',
+		'options': { modifiers: 'transform' },
+		'expected': '(?:[\\u0416\\u0429\\u0436\\u0449])',
+	},


Can you add a test case for (?i:\u{10570})? It should return (?:[\u{10570}\u{10597}]).

The Node.js 6 compat test should catch the aforementioned issue, which we should avoid.

Support non-ASCII characters inside inline IgnoreCase clause

4ca5b44

JLHwung reviewed Sep 19, 2023

View reviewed changes

JLHwung mentioned this pull request Sep 30, 2023

Support non-ascii case folding within i modifier #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ASCII characters inside inline `i` clause #80

Support non-ASCII characters inside inline `i` clause #80

stulov commented Jul 23, 2023

JLHwung left a comment

JLHwung Sep 19, 2023 •

edited

JLHwung Sep 19, 2023 •

edited

JLHwung Sep 19, 2023

	// From <http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt>:
	//
	// The status field is:
	// C: common case folding, common mappings shared by both simple and full
	// mappings.
	// F: full case folding, mappings that cause strings to grow in length. Multiple
	// characters are separated by spaces.
	// S: simple case folding, mappings to single characters where different from F.
	// T: special case for uppercase I and dotted uppercase I
	// - For non-Turkic languages, this mapping is normally not used.
	// - For Turkic languages (tr, az), this mapping can be used instead of the
	// normal mapping for these characters. Note that the Turkic mappings do
	// not maintain canonical equivalence without additional processing.
	// See the discussions of case mapping in the Unicode Standard for more
	// information.
	//
	// Usage:
	// A. To do a simple case folding, use the mappings with status C + S.

Support non-ASCII characters inside inline i clause #80

Are you sure you want to change the base?

Support non-ASCII characters inside inline i clause #80

Conversation

stulov commented Jul 23, 2023

JLHwung left a comment

Choose a reason for hiding this comment

JLHwung Sep 19, 2023 • edited

Choose a reason for hiding this comment

JLHwung Sep 19, 2023 • edited

Choose a reason for hiding this comment

JLHwung Sep 19, 2023

Choose a reason for hiding this comment

Support non-ASCII characters inside inline `i` clause #80

Support non-ASCII characters inside inline `i` clause #80

JLHwung Sep 19, 2023 •

edited

JLHwung Sep 19, 2023 •

edited