Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII directive names #23

Open
4 tasks done
viktor-yakubiv opened this issue Dec 27, 2023 · 8 comments · May be fixed by #24
Open
4 tasks done

Non-ASCII directive names #23

viktor-yakubiv opened this issue Dec 27, 2023 · 8 comments · May be fixed by #24
Labels
🤞 phase/open Post is being triaged manually

Comments

@viktor-yakubiv
Copy link

Initial checklist

Problem

I write text files using an extended markdown syntax with a flavour for specific needs. Those text files are not in Latin script. I want to keep them in a uniform language without formatting prompts in English.

Markdown in general appears to have a language-independent syntax. ASCII-limited directives bring language-dependence.

Specific example

I am a Ukrainian speaker, creating a project for the local community with no internationalisation need in the future. I want to keep files in my native language as much as possible and have syntax as simple as possible.

My text files are songs. Sometimes, they contain a chorus that repeats after each verse (paragraph). Take a timely example:

Dashing through the snow
In a one-horse open sleigh
O'er the fields we go
Laughing all the way
Bells on bob tail [sic] ring
Making spirits bright
What fun it is to ride and sing
A sleighing song tonight! Oh!

:::chorus
   Jingle bells, jingle bells,
   Jingle all the way.
   Oh! what fun it is to ride
   In a one-horse open sleigh. Hey!
   Jingle bells, jingle bells,
   Jingle all the way;
   Oh! what fun it is to ride
   In a one-horse open sleigh.
:::

A day or two ago
I thought I'd take a ride
And soon, Miss Fanny Bright
Was seated by my side,
The horse was lean and lank
Misfortune seemed his lot
He got into a drifted bank
And then we got upsot.

A day or two ago,
The story I must tell
I went out on the snow,
And on my back I fell;
A gent was riding by
In a one-horse open sleigh,
He laughed as there I sprawling lie,
But quickly drove away. Ah!

Now the ground is white
Go it while you're young,
Take the girls tonight
and sing this sleighing song;
Just get a bobtailed bay
Two forty as his speed
Hitch him to an open sleigh
And crack! you'll take the lead.

My custom script detects the chorus and repeats it after each paragraph. However, chorus in Ukrainian is приспів and I would love to keep that native word in a Ukrainian text.

Solution

Configurable naming limitations.

Alternatives

  1. Find and replace before directive parsing.
  2. A forked parser with a patch
@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Dec 27, 2023
@viktor-yakubiv
Copy link
Author

Here I shared my specific problem. I don't object to the current implementation with the imposed limitations backed up with solid reasoning in the readme about spacing and trailing colons.

I would love to understand the rationale behind limiting the directive naming.

@ChristianMurphy
Copy link
Member

@wooorm may be able to offer more context.
From reviewing the description/spec https://talk.commonmark.org/t/generic-directives-plugins-syntax/444
I believe the intent is to be roughly compatible with html/custom element naming conventions https://html.spec.whatwg.org/multipage/custom-elements.html#valid-custom-element-name https://developer.mozilla.org/en-US/docs/Web/API/CustomElementRegistry/define#valid_custom_element_names which require the sequence start with an ASCII character (the difference being that directives do not require a dash).

@wooorm
Copy link
Member

wooorm commented Jan 1, 2024

The reason the current state is the way it is, is so that I didn’t have to decide.

Custom elements looks like a good thing to be compatible with. Although I don’t think a) the -, b) the disallowed uppercase, c) the disallow list such as font-face and such needs to be enforced. That is to say: it’s not bad if we allow some names that aren’t strictly compatible with HTML custom elements.

I wonder whether we need to enforce the disallowed ASCII punctuation/symbols though. I can see $ being useful, as it’s in JS too. Putting say ( or ' or / or ; in there seems weird. Although, as HTML allows much of those characters in attribute names, perhaps we can allow them too? Otherwise we should have different handling for “tag” names and “attribute” names.

Maybe simplest is to allow all unicode characters that are not unicode whitespace? https://github.com/micromark/micromark/blob/929275e2ccdfc8fd54adb1e1da611020600cc951/packages/micromark-util-character/dev/index.js#L232

@viktor-yakubiv
Copy link
Author

@wooorm and @ChristianMurphy thank you for sharing your details. I also have assumed custom-elements (rather) HTML elements naming convention but I wanted to clarify this. If this is not a strict requirement, I would appreciate a change.

Thinking of a potential solution, character ranges listed in the HTML standard for custom element names seem to be reasonable to me. The PCENChar (potential custom element name character) is quite wide; it seems to allow all "alphabets", including characters needed in my case.

PCENChar ::=
  "-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

Yet, it is beyond the proposed simplest solution and still enforces some limits. What do you think?

Script, I used to preview ranges

I am not knowledgeable in the Unicode char ranges, so I asked ChatGPT what range numbers mean (extended Latin, Japanese, Greek, Cyrillic etc) and reviewed the list manually using a script.

// "-"
// "."
// [0-9]
// "_"
// [a-z]
chars.push(String.fromCharCode(0xB7))
for (let i = 0xC0; i <= 0xD6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xD8; i <= 0xF6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF8; i <= 0x37D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x37F; i <= 0x1FFF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x200C; i <= 0x200D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x203F; i <= 0x2040; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2070; i <= 0x218F; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2C00; i <= 0x2FEF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x3001; i <= 0xD7FF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF900; i <= 0xFDCF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xFDF0; i <= 0xFFFD; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x10000; i <= 0xEFFFF; ++i) chars.push(String.fromCharCode(i))

console.log(chars.join('\n'))

@wooorm
Copy link
Member

wooorm commented Jan 10, 2024

Some more considerations:

  • Allowing most custom element names is indeed nice, but it’s not a goal to only support custom element names. Directives are not only useful for HTML. One existing example is that Docusaurus treats them as an alternative to JSX. Meaning the names should also be able to match (most) JS identifiers.
  • In HTML, tag names and attribute names can match basically anything, because these names can only occur in special places. The < and whitespace and = and / and > are very strong indicators of where the parser is.
    In markdown, this is more complex. Is :a*b* an a directive followed by emphasis or an a*b* directive? Is :a$b$ an a$b$ directive, or is b math, when enabled?

Custom elements allow basically all higher-than-ascii punctuation, and in the ASCII range -, ., _.
JavaScript identifiers do not allow most punctuation, but allow $ and _ in the ASCII range.
In markdown, all ASCII punctuation either already is something in CM (_) or could be something (such as $ for math).

So I’d prefer starting with few ASCII punctuation, we can expand later:

  • Disallow all whitespace/controls
  • Disallow ascii punctuation, except allow ., -, _
  • Allow the rest (basically alphanumerical and higher-than-ascii punctuation)

@viktor-yakubiv
Copy link
Author

viktor-yakubiv commented Jan 10, 2024

basically alphanumerical and higher-than-ascii punctuation

@wooorm do you have \w in mind or anything else?


I have found that /[\p{L}\p{N}][\p{L}\p{N}.-_]*/u might work just fine, where \p{N} is a Unicode number, and \p{L} is a Unicode letter (docs, look for # General_Category).

This may be expanded to:

export const unicodeAlphanumeric = regexCheck(/[\p{L}\p{N}]/u)

If we come to an agreement, I could prepare a pull request. What do you think?

@wooorm
Copy link
Member

wooorm commented Jan 10, 2024

We already have the parts in micromark. I think this is fine:

const fine = code <= codes.del
  ? code === codes.dash ||
      code === codes.dot ||
      code === codes.underscore ||
      asciiAlphanumeric(code)
  : classifyCharacter(code) !== constants.characterGroupWhitespace

Using asciiAlphanumeric from micromark-util-character, classifyCharacter from micromark-util-classify-character, and codes and constants from micromark-util-symbol!

@wooorm
Copy link
Member

wooorm commented Jan 10, 2024

Note I think similar rules need to be applied to attribute names. They are a bit more complex because say .a.b is already a shortcut for two classes.

Attributes are also prohibited from starting with an ASCII number (they’re currently only accepting ASCII too). I wonder if that’s needed.

@viktor-yakubiv viktor-yakubiv linked a pull request Jan 11, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤞 phase/open Post is being triaged manually
Development

Successfully merging a pull request may close this issue.

3 participants