Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Simplify unicode punctuation #2841

Merged
merged 2 commits into from
Jun 10, 2023

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Jun 7, 2023

Marked version: 5.0.5

Markdown flavor: GitHub Flavored Markdown

Description

Cleans up the unicode punctuation from #2811 by using \p{P} instead of a long list of unicode characters. There are a handful of punctuation characters $+<=>`^|~ not included in that set for whatever reason, so they are still specified here. Includes the accompanying tweaks to a couple other regexes to apply it correctly. This also lets us cover a slightly larger punctuation set since my understanding is JS unicode symbols end at \uFFFF but there are a few more after that.

And a tiny unrelated logic simplification in the emStrong Tokenizer.

My only question is if there is a better way to exclude single characters from \p{P}, for instance in the emStrong, we don't include the current delimiter * or _ in the punctuation checks. I get around this now with an additional lookahead regex:

Something like (?!_)\p{P}

For instance, this example lets us exclude _ from the \p{P} group. I'm ok with this, but if there is some secret "subtraction from a unicode set" syntax, I would like to know.

I didn't add tests, but could potentially look up some of the characters that were missing previously and add them to the existing unicode test.

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@calculuschild calculuschild added the RR - refactor & re-engineer Results in an improvement to developers using Marked, or end-users, or both. label Jun 7, 2023
@vercel
Copy link

vercel bot commented Jun 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
marked-website ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 7, 2023 5:42pm

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the current tests are enough to tell if this is working as expected.

@UziTech UziTech changed the title Simplify unicode punctuation fix: Simplify unicode punctuation Jun 10, 2023
@UziTech UziTech merged commit f19fe76 into markedjs:master Jun 10, 2023
11 checks passed
github-actions bot pushed a commit that referenced this pull request Jun 10, 2023
# [5.1.0](v5.0.5...v5.1.0) (2023-06-10)

### Bug Fixes

* Simplify unicode punctuation ([#2841](#2841)) ([f19fe76](f19fe76))

### Features

* add Marked instance ([#2831](#2831)) ([353e13b](353e13b))
@UziTech UziTech mentioned this pull request Jul 17, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RR - refactor & re-engineer Results in an improvement to developers using Marked, or end-users, or both.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants