Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix v-flag bugs #85

Merged
merged 7 commits into from Sep 23, 2023
Merged

Conversation

JLHwung
Copy link
Collaborator

@JLHwung JLHwung commented Sep 19, 2023

In this PR we reuse the unicode fixtures for the v-flag tests, based the observation that /.../u and /.../v should yield the same result unless set/string properties features are involved.

We also introduce the matches and nonMatches properties to the v-flag fixture runner: They includes the strings that the transpiled regex is supposed to match / reject. It is useful when the transpiled regex is too verbose for proper comprehension.

This PR includes commits from #84, I will rebase once that PR is merged.

This is a draft PR as I still haven't figured out how to avoid double-bmpify regex strings: In the negative set notation we extract single code points from the UNICODE_SET, which yields surrogate stuffs in the output, but then it was bmp-ified again in the regenerate, yielding longer than necessary results, though it seems correct.

@JLHwung JLHwung marked this pull request as draft September 19, 2023 20:33
@@ -105,6 +105,8 @@ const unicodeSetFixtures = [
},
{
pattern: '[^[a-z][f-h]]',
matches: ["A", "\u{12345}"],
nonMatches: ["a", "z"],
expected: '(?:(?![a-z])[\\s\\S])',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current transpiled result does not match "\u{12345}".

@JLHwung JLHwung marked this pull request as ready for review September 20, 2023 19:42
);
const negativeSet = UNICODE_SET.clone().remove(singleChars);
const bmpOnly = regenerateContainsAstral(negativeSet);
update(characterClassItem, negativeSet.toString({ bmpOnly: bmpOnly }));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the regenerate set spans from code points before surrogate to astral sets, toString({ bmpOnly: false }) returns much more verbose results while toString({ bmpOnly: false }) is already correct: I think it should be fixed in regenerate later.

const regenerate = require('regenerate');
const set = regenerate().addRange(0xd000, 0x10000);

console.log(set.toString());
// [\uD000-\uD7FF\uE000-\uFFFF]|\uD800\uDC00|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]


console.log(set.toString({ bmpOnly: true }));
// [\uD000-\uFFFF]|\uD800\uDC00

The latter is apparently correct as it matches lone surrogates as well as U+10000. The former seems like [\uD000-\uFFFF]|\uD800\uDC00 is passed to the bmp pass again.

expected: '(?:[\\0-JL-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF])',
matches: ["k", "\u212a", "\u{12345}", "\uDAAA", "\uDDDD"],
nonMatches: ["K"],
expected: '(?:[\\0-JL-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF])',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are now much shorter and easier to reason about. I also added matches tests so that we are confident that transpiled result is correct.

Copy link
Collaborator

@nicolo-ribaudo nicolo-ribaudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@nicolo-ribaudo nicolo-ribaudo merged commit 91ee342 into mathiasbynens:main Sep 23, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants