Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent-abbreviations - non-ASCII characters ignored in filenames #2308

Conversation

MichaelBlm
Copy link
Contributor

Fixes #2292

@MichaelBlm
Copy link
Contributor Author

Here is an explanation of the updated regular expression:

  1. (?=\P{Ll}): This is a positive lookahead assertion. It matches a position where the next character is not a Unicode lowercase letter. \P{Ll} is a Unicode character class that negates the Ll (lowercase letter) class.

  2. (?<=\P{L}): This is a positive lookbehind assertion. It matches a position where the previous character is not a Unicode letter (uppercase or lowercase). \P{L} negates the L (letter) Unicode character class.

@fisker
Copy link
Collaborator

fisker commented Apr 4, 2024

The solution looks tricky to me, I prefer just ignore if non-ascii characters included.

@@ -116,7 +116,7 @@ const getNameReplacements = (name, options, limit = 3) => {
}

// Split words
const words = name.split(/(?=[^a-z])|(?<=[^A-Za-z])/).filter(Boolean);
const words = name.split(/(?=\P{Ll})|(?<=\P{L})/u).filter(Boolean);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the verbose names (Lowercase_Letter), for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include the verbose names

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a rule for it 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on it actually

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the proposal is on another repo.

@sindresorhus
Copy link
Owner

@fisker Tricky how? The solution looks fine to me.

@MichaelBlm MichaelBlm force-pushed the prevent-abbreviations-non-ascii-characters-ignored branch from 6295fde to 411f6e3 Compare April 4, 2024 18:38
@MichaelBlm
Copy link
Contributor Author

My explanation of the regex wasn't that clear. \P{Letter} is equivalent to [^a-zA-Z] but it includes non-ASCII characters

@fisker
Copy link
Collaborator

fisker commented Apr 4, 2024

@fisker Tricky how? The solution looks fine to me.

In different languages, the latin letters can be used with other characters together to be a word, so the non-letter characters not a word boundary.

For example, in Chinese, "B超" "T恤" are words. Similar in Japanese, I think.

@MichaelBlm
Copy link
Contributor Author

@fisker Tricky how? The solution looks fine to me.

In different languages, the latin letters can be used with other characters together to be a word, so the non-letter characters not a word boundary.

For example, in Chinese, "B超" "T恤" are words. Similar in Japanese, I think.

Can you please provide me a valid/invalid test case to see if this fix doesn't already account for that?

@fisker
Copy link
Collaborator

fisker commented Apr 6, 2024

I don't really have a case now.

@sindresorhus sindresorhus merged commit 28762c8 into sindresorhus:main Apr 6, 2024
18 checks passed
@fregante
Copy link
Collaborator

fregante commented Apr 6, 2024

@fisker you're saying that 龖.js will be counted as "one-letter long", but in Chinese it's a fully-formed "word", not an abbreviation. Hence "ignore if non-ascii characters included"

@fisker
Copy link
Collaborator

fisker commented Apr 6, 2024

I don't think Chinese characters are "letter"s, but I can be wrong.

@fregante
Copy link
Collaborator

fregante commented Apr 7, 2024

That's right, it's why quoted "one-letter word". The rule counts characters as "letters" but not all writing systems have letters.

https://en.wikipedia.org/wiki/Writing_system#Basic_terminology

The rule is based on the idea that no concept can be defined in 2 characters because in most languages words are longer than that.

@fregante
Copy link
Collaborator

fregante commented Apr 7, 2024

@MichaelBlm
Copy link
Contributor Author

I'll explore a solution using the segmenter. Looks promising

@MichaelBlm
Copy link
Contributor Author

I added this code to the valid tests and it is passing with the current implementation? Is that the correct behavior?
{ code: 'foo();', filename: '龖.js', },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prevent-abbreviations - non-ASCII characters ignored in filenames
4 participants