Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable wide Unicode support for names #24

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

viktor-yakubiv
Copy link

@viktor-yakubiv viktor-yakubiv commented Jan 11, 2024

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)
  • If applicable, I’ve added docs and tests

Description of changes

Enables almost full Unicode support for directive names. This is tricky to test, I've added only Latin, Greek and Cyrillyc characters. Also, I have tested combining accent modifiers at the middle and at the end of directive name.

Attribute names are worth to look too but maybe in a separate PR.

The PR should be ready to review. Thank you!

Closes #23

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jan 11, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7f23ba8) 100.00% compared to head (73eb92a) 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #24   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9         9           
  Lines         1416      1439   +23     
=========================================
+ Hits          1416      1439   +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viktor-yakubiv viktor-yakubiv marked this pull request as draft January 11, 2024 16:37
@viktor-yakubiv viktor-yakubiv marked this pull request as ready for review January 11, 2024 18:44
Copy link
Member

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style notes :)

dev/lib/factory-name.js Outdated Show resolved Hide resolved
return self.previous === codes.dash || self.previous === codes.underscore
? nok(code)
: ok(code)
return allowedEdgeCharacter(self.previous) ? ok(code) : nok(code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn’t dashes also be edge characters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean allowed edge characters, it was forbidden by the spec previously. I kept it but I don't mind changing.

Currently, the name cannot either start or end with any punctuation or underscore.

Is this something you suggest to change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To refresh my memory: So, in name, this last stuff is about what is possible to exit after.
That behavior at the end is very different from whether the first character is allowed to start a name.
Before, there was a very different check compared to the check in start: - and _ were allowed in names but not at the end.
Now they’re the same. I’m not sure if that’s useful? Perhaps the last line should just be return ok(code)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried return ok(code) at the end. Allowing the name to end with an underscore interferes with emphasis notation. I reverted the commit.

I think, there is no point to allow punctuation in the end but don't allow at the start. If we are going to allow punctuation, it should be (almost) equal.

Possible options:

  1. Leave as is
  2. Allow any punctuation at the start but forbid markdown special chars at the end. Those should include =, ~ (special in some flavours), _, *, parentheses and perhaps some more — basically anything in the ASCII (in comparison to option 3, blacklist under 128).
  3. Allow any punctuation at the start and end if it's beyond ASCII (in comparison to option 2, whitelist above 127)

test/index.js Outdated Show resolved Hide resolved
@wooorm
Copy link
Member

wooorm commented Jan 12, 2024

You could maybe test the math symbol for pi π, and an emoji, such as 🌍?

Copy link
Author

@viktor-yakubiv viktor-yakubiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have incorporated the requested changes.

Comment on lines +478 to +481
assert.equal(micromark('::𝜋∈ℝ', options()), '') // Italic
assert.equal(micromark('::𝛑≈3.14', options()), '') // Bold
assert.equal(micromark('::𝝅∉ℚ', options()), '') // Bold italic
assert.equal(micromark('::𝞹≠3.14', options()), '') // Sans bold italic
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wooorm the letter Pi you have suggested, is a part of the Greek alphabet covered by the test:

should support unicode alphabets in name

In this (an other similar tests), I added characters that have the word mathematical in their names.

@wooorm
Copy link
Member

wooorm commented Jan 29, 2024

Thanks for your continued work. I’ve been wondering the last week what to do about punctuation. And about the implications to the semver version of this package.

In the ASCII range we allow ascii alphanumerics and -, ., _.
I think in the rest of the unicode range we should also only allow alphanumerics.
We should be able to do that by checking classifyCharacter(x) === undefined.
I don’t see unicode “punctuation” tested much, could you see if that changes things?

Some examples of symbols/punctuation outside of the ascii range are and £!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤞 phase/open Post is being triaged manually
Development

Successfully merging this pull request may close these issues.

Non-ASCII directive names
3 participants