Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regexp/unicode-property rule #722

Merged
merged 4 commits into from Apr 8, 2024
Merged

Add regexp/unicode-property rule #722

merged 4 commits into from Apr 8, 2024

Conversation

RunDevelopment
Copy link
Collaborator

@RunDevelopment RunDevelopment commented Apr 6, 2024

Fixes #720

This PR adds a new rule that allows users to enforce the naming of Unicode properties. It has 3 main features:

  1. Removing/adding gc=/General_Category= keys, e.g. \p{gc=L} -> \p{L}. These prefixes are unnecessary, because the values of the General_Category property can be accessed without the key.
  2. Enforcing long or short keys for General_Category/gc, Script/sc, and Script_Extensions/scx.
  3. Enforcing long or short names of values and binary properties. E.g. \p{L} -> \p{Letter} and \p{Hex} -> \p{Hex_Digit}.

All of these feature can be individually configured and turned off by the user. The regexp/unicode-property is not included in our recommended config, because this rule only enforces a specific style.

Default configuration

The default configuration is the following:

{
    "generalCategory": "never",
    "key": "ignore",
    "property": {
        "binary": "ignore",
        "generalCategory": "ignore",
        "script": "long",
    }
}

This means that, by default, the rule will (1) remove General_Category/gc keys (e.g. \p{gc=L} -> \p{L}) and (2) enforce long names for values of the Script and Script_Extensions properties (e.g. \p{sc=Kana} -> \p{sc=Katakana}).

I chose a minimal configuration because I didn't want to make the rule generate a lot of error for people trying to adapt the rule. I think the 2 effects work well in any code base, no matter what style they usually prefer. (1) simply removes an unnecessary prefix to "simplify" the regex, and (2) prevents the use of the (IMO) horrible aliases for scripts.

Unicode data

Since I needed the data for the mapping between aliases to implement this rule, I had to make the choice between taking a dependency (e.g. @unicode/unicode-15.0.0) or including the relevant data in the source files of this project.

I chose against adding a dependency, because it was easy enough to get the data I needed and because most of @unicode/unicode-15.0.0 would be dead weight to us.

However, the data I included is used through an API (the AliasMap class), so we can easily switch to using a dependency without needing to change the regexp/unicode-property rule.

Copy link

changeset-bot bot commented Apr 6, 2024

🦋 Changeset detected

Latest commit: f2ff745

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
eslint-plugin-regexp Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Owner

@ota-meshi ota-meshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thank you!

@ota-meshi ota-meshi merged commit 35c8153 into master Apr 8, 2024
7 checks passed
@ota-meshi ota-meshi deleted the issue720 branch April 8, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefer long unicode class names over the short ones
3 participants