Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all emojis work as bare keys #954

Open
arp242 opened this issue Jan 16, 2023 · 27 comments · May be fixed by #990 or #1002
Open

Not all emojis work as bare keys #954

arp242 opened this issue Jan 16, 2023 · 27 comments · May be fixed by #990 or #1002

Comments

@arp242
Copy link
Contributor

arp242 commented Jan 16, 2023

I was writing test cases for this, and using a pirate flag (🏴‍☠️) doesn't work; this is:

     CPoint  Dec    UTF8        HTML       Name (Cat)
'🏴' U+1F3F4 127988 f0 9f 8f b4 🏴  WAVING BLACK FLAG (Other_Symbol)
'�'  U+200D  8205   e2 80 8d    ‍      ZERO WIDTH JOINER (Format)
'☠'  U+2620  9760   e2 98 a0    ☠   SKULL AND CROSSBONES (Other_Symbol)

The flag and ZWJ is fine, but the skull and crossbones isn't allowed in the current range.

Seems confusing since most emojis work. Took me quite a bit of time to figure when modifying my parser to support this because I just assumed I missed something, but turns out it's just not in the allowed range:

unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces

Looking at the U+2500..U+2bff range, I don't really see why we need to skip a lot of these things.


I know we discussed this before, but I still think we should either allow only letters+numbers or just allow almost everything (with a few exceptions); the current behaviour is just confusing. The examples uses an emoji as an example and ZWJ is explicitly allowed, so you'd expect all emojis to work, but turns out only some emojis work. It just so happened by chance that "pirate flag" was the first emoji I tried, but there are probably others as well and with ZWJ combinations it'll be a whack-a-mole.

Either way, IMHO we should support all emojis or none. Many other ZWJ combinations do work fine; 🏳️‍🌈 (U+1F3F3 ZWJ U+1F308) or 🏴󠁧󠁢󠁷󠁬󠁳󠁿 is okay, but 🏳️‍⚧️ isn't (as U+26A7 isn't in the allowed range). In a quick test it seems all flags work, except two.

Originally posted by @arp242 in #891 (comment)

@arp242
Copy link
Contributor Author

arp242 commented Jan 16, 2023

I did a quick check, and 179 emojis currently fail (the other 1530 work); here's a list: https://gist.github.com/arp242/a3b99e52c9dea2b6e2d6217aab490ad3 (that's based on Unicode 14, not 15, so there may be a few more – I need to update my tool to 15).

Also, the variation selectors (U+FE0F in the example above) are a right pain; these are pretty much invisible in most editors. These should be excluded together with all the RTL stuff (which already are).

@abelbraaksma
Copy link
Contributor

Either way, IMHO we should support all emojis or none.

You use ZWJ for creating the emoji. While this is fine, the overlaid code point itself is not in the proper range. We’ve looked at more complex ranges, but decided against it for the added complexity it brings. There will always be some ranges of code points people may feel are missing. Using ZWJ you can ‘invent’ emojis or other characters.

I agree it is somewhat unfortunate that certain combinations are currently not possible. But keep in mind that we’re talking about code point ranges, not about characters. And what you’re describing is allowing certain characters, which is an avenue we’re trying to avoid.

@abelbraaksma
Copy link
Contributor

That said, it’s possibly an oversight, as I don’t see anything in 2600-26FF that need be illegal. We’d have to look a little bit closer to the wider range you mention and the discussion or commit log to find out whether we did this deliberately (and then reassess whether that conclusion is still valid) or it was an honest mistake in the added ranges.

I tried to be meticulous, but hey, we’re only human ;).

Keep in mind that there’s also the argument that we don’t want to over-complicate the ranges. We try to be inclusive, and mainly ban ‘unsuitable’ ranges, while including the rest.

@arp242
Copy link
Contributor Author

arp242 commented Jan 16, 2023

Keep in mind that there’s also the argument that we don’t want to over-complicate the ranges. We try to be inclusive, and mainly ban ‘unsuitable’ ranges, while including the rest.

Yes, the current check I need to do is:

func isBareKeyChar(r rune) bool {
	return (r >= 'A' && r <= 'Z') ||
		(r >= 'a' && r <= 'z') ||
		(r >= '0' && r <= '9') ||
		r == '_' || r == '-' ||
		r == 0xb2 || r == 0xb3 || r == 0xb9 || (r >= 0xbc && r <= 0xbe) ||
		(r >= 0xc0 && r <= 0xd6) || (r >= 0xd8 && r <= 0xf6) || (r >= 0xf8 && r <= 0x037d) ||
		(r >= 0x037f && r <= 0x1fff) ||
		(r >= 0x200c && r <= 0x200d) || (r >= 0x203f && r <= 0x2040) ||
		(r >= 0x2070 && r <= 0x218f) || (r >= 0x2460 && r <= 0x24ff) ||
		(r >= 0x2c00 && r <= 0x2fef) || (r >= 0x3001 && r <= 0xd7ff) ||
		(r >= 0xf900 && r <= 0xfdcf) || (r >= 0xfdf0 && r <= 0xfffd) ||
		(r >= 0x10000 && r <= 0xeffff)
}

Which doesn't exactly fill me with joy. But I'd rather have one somewhat ugly "wtf?!" function rather than silly stuff like "😗 works but ☺️ doesn't". These are quite distinct codepoints, but grouped next to each other in "emoji ordering". The way emojis work is a bit of a mess.

@abelbraaksma
Copy link
Contributor

I found out what the original motivation was:

#687 (comment)

Basically, we accept letter-like code points. Dingbats, mathematical symbols and box drawing code points aren’t ‘letter-like’. Neither are emojis, of course. But the ranges of emojis that are allowed have been added to later versions of Unicode and belong to “other languages that weren’t previously assigned”. As such, they belong to “be liberal in what to accept from future versions of Unicode”.

Perhaps the right cause of action would’ve been to exclude other non letter-like ranges from later versions. However, that brought about another downside: that ID tokens in HTML and XML would not be valid unquoted names. Several RFCs overlap with the current definition. While this is not necessarily a goal for TOML, it has its benefits.

With the arguments in the mentioned thread, I still think we’re on the right track here, using the ‘letter-like’ definition of the most widely implemented and used Unicode version (I believe that’s 5 or 6, at least .NET Framework and Windows prior to v11 (or v10?) use 5.x.).

Of course, we could allow more tokens that aren’t allowed elsewhere. Or disallow more tokens that aren’t disallowed elsewhere. This would take us further away from widely established identifier definitions, but we may choose to go down that path.

@arp242
Copy link
Contributor Author

arp242 commented Jan 16, 2023

This would take us further away from widely established identifier definitions

TOML already allows almost everything as quoted keys, so I think this doesn't matter at all. Directly using TOML keys in HTML, XML, or pretty much anywhere else without processing is already something you can't do.

Looking at some other environments, there doesn't seem that much consensus in the first place:

  • Go – Unicode category letter or digit.
  • C# – Category Letter, all subcategories; category Number, subcategory letter.
  • RustTR #31 with some changes (I didn't read through all of TR31).
  • HCL – TR31.
  • Python – TR31 with some changes (also see PEP 3131).
  • Swift has something similar to what's in TOML now, although not exactly identical.

Note sure what other languages/formats support Unicode identifiers from the top of my head.


Going back to basics, the goals I'd set would be:

  • Allow people to use their script/alphabet of choice (Chinese, Tamil, Icelandic, Cuneiform, whatnot).
  • Minimize potential for confusion.
  • Be consistent; this is an extension of the previous point.

In that sense, "support emojis" is out of scope IMO; I don't think it would be horrible to lose support for it especially since you can still use them inside quoted keys. BUT having ~90% of the emojis work fine and ~10% not work is a bug IMO, especially since an emoji is explicitly included as an example.

It's probably better to allow too little and then expand on that later if there's a demand for it. Once we allow something we can never take it back because that would break compatibility. And there's also #941; we need something for that to address "minimize potential for confusion", and tweaking (i.e. limiting) the set of allowed codepoints is one possible way to address that.

@pradyunsg
Copy link
Member

pradyunsg commented Jan 16, 2023

TBH, it looks like we should align with Unicode TR31 syntax, rather than trying to come up with something else.

It's what Go, Rust and Python seem to be doing (IIUC), and I think that might just be a more "obvious" way to achieve what we want to achieve here.

@ChristianSi
Copy link
Contributor

ChristianSi commented Jan 17, 2023

The only possibly problems I see in this range are the Eight Trigrams (☰ ☱ ☲ ☳ ☴ ☵ ☶ ☷) and various symbols related to yin and yang (⚊ ⚋ ⚌ ⚍ ⚎ ⚏). Some of these, especially ⚌, look very much like the equals sign (=), therefore it might be a good idea to avoid them in unquoted keys to prevent possible confusion.

One idea: shorten the forbidden range to from U+2630 (☰) to U+268F (⚏). However, unfortunately in this shorter range there are still some very popular symbols (e.g. ☺ ♀ ♂) whose non-allowance could remain confusing.

Another idea: actually the Eight Trigrams are probably OK, since they all have three lines rather than the two of the equals sign. Two of the yin and yangs symbols (⚊ ⚋) should be fine too, since they look similar to the underscore, which is already allowed. So we could just forbid U+268C to U+268F (⚌ ⚍ ⚎ ⚏), allowing everything else in that range.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 17, 2023

TBH, it looks like we should align with Unicode TR31 syntax

We previously decided against it, because it’s complex and, iirc, relies on categories. It’s likely (but I’d have to check) that it isn’t compatible with what currently have (apart from already allowing starting with a digit).

Edit: the TR31 set is very disjoint to what we have:

ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.

ID_Continue characters include ID_Start characters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus Other_ID_Continue, minus Pattern_Syntax and Pattern_White_Space code points.

The two biggest issues: it uses categories, and letter-like only. Miscellaneous Symbols, which we explicitly include, are forbidden. Also, categories are dependent on Unicode version, which we try to avoid.

The full range, for any supported Unicode version, is rather complex. In the previous thread, there’s a comment that shows how complex, and we all kinda sighed with relieve that in the end it wasn’t necessary to go that route.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 17, 2023

Some of these, especially ⚌, look very much like the equals sign (=),

If we really want to include this range, we can do the same as we did for the Greek Question Mark (which looks like a semi colon) and just forbid only the Yin & Yang sign that looks like the equal sign.

TOML already allows almost everything as quoted keys

That’s a good point.

Allow people to use their script/alphabet of choice (Chinese, Tamil, Icelandic, Cuneiform, whatnot).

Agreed. Which is what we support. Miscellaneous Symbols do not fit that description, but I see your other point that it’s a little confusing that some ranges are currently excluded.

I’m not against including it (the range in the OP). However, I’m a little afraid that every few months we’re going to open this up again because some person’s favourite symbol isn’t allowed. I maybe wrong about this, of course, perhaps this is the ‘last missing range’.

We’ve spent many months coming to the current range, at some point we’d just have to settle and call it a day ;).

@ChristianSi
Copy link
Contributor

@abelbraaksma: Yeah, just forbidding (U+268C) and allowing everything else in that block would be fine with me as well.

@eksortso
Copy link
Contributor

eksortso commented Jan 20, 2023

@abelbraaksma @ChristianSi I would much rather prefer that we include the Miscellaneous Character block but exclude the two-line yin and yang symbols U+268C to U+268F, as previously suggested, due to their resemblance to the equals sign.

unquoted-key-char =/ %x2600-268B / %x2690-26FF          ; include Miscellaneous Symbols, but exclude symbols resembling an equals sign

@arp242
Copy link
Contributor Author

arp242 commented Jan 20, 2023

There are some other syntax-like homographs too:

'#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
'"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
'﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
'﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
'﹐' U+FE50     SMALL COMMA (Other_Punctuation)
'︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
'˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
'՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
'܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
'₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
'⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
'࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

That's from a quick visual inspection; not a full list. There's some more in the "Halfwidth and Fullwidth Forms" and "Small Form Variants" blocks in particular.

@abelbraaksma
Copy link
Contributor

That’s an interesting list, but i don’t think we should try to be exhaustive here. There’ll always be certain glyphs that look confusing. Put in ZWJ and you cancreate any glyph, from smaller components.

@arp242
Copy link
Contributor Author

arp242 commented Jan 20, 2023

Maybe we shouldn't allow ZWJ?

I've been going back-and-forth on what to do about all of this. While the original issue is "Not all emojis work as bare keys", this ties in to other issues as well and there are knock-on effects.

We already allow almost everything as quoted keys. In hindsight, I think this was a mistake, but we can't change that now, and people don't use quoted keys that much since it's annoying to type (many TOML users probably don't even know you can use it) so it's less of an issue in the real world. With bare keys, people will actually start using all the stuff that's allowed.

I'm worried that allowing too much will lead to confusion. Homoglyphs are actually not something I'm very worried about since no reasonable person would use "# trollolol = 1" (U+FF03, not a "real" hash) other than maybe as a practical joke on your coworkers. No one really enters these things by accident. Other things that are explicitly excluded now like the multiplication sign (×) isn't that much of an issue either; it's very similar to the letter "x", but no one enters "×" by accident when they intended to write "x".

I think it's fine to allow TOML users to do "stupid things", and it's okay to rely on TOML users being reasonably sane.

What I am worried about are "invisible" characters such as ZWJ, variation selectors, combining characters, and things like that. All of this is very non-obvious, and easy to get confused by, even for people well versed in how all of this works (i.e. you and me).

So while "# trollolol = 1" is certainly confusing, it's not really an issue that crops up in the real world. Same with U+268C-U+268F. I think this is almost a philosophical issue: "if a tree in a forest is confusing but no one sees the tree being confusing, then is it really confusing?"

So, back to ZWJ: if we disallow ZWJ lots of emojis won't work, and to be consistent we'd have to disallow at least the commonly used emojis like 😂 and whatnot, which would make the codepoint range a bit more complex.

However, in general, I'd say:

  • It is better to allow too little than too much, as we can correct this in the future.
  • Ease of authoring documents should be prioritized over easy of implementation. What this means here is that I'd rather have a more complex set of codepoints in the ABNF than end up with confused users. Not that easy of implementation isn't important, just less so.

So I'd say we probably shouldn't allow ZWJ, and variation selectors, and combining characters, and perhaps a few other things. Those are things that will lead to confusion, unlike ⚌, #, and whatnot. I don't actually care all that much about those because I don't expect anyone will be confused by it in real-world scenarios.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 20, 2023

ZWJ is used in many scripts to create valid characters, glyphs and words. It’s not exclusive to emojis. I don’t think having it is an issue. More the opposite. The whole idea here is to be inclusive wrt languages and scripts. The side effect of this approach is that some emojis also work, because they are in codepoint ranges not explicitly excluded, mainly because these ranges weren’t assigned to in older versions of Unicode.

By en large this should be fine. Identifiers will typically be expressed in someone’s native language, script, or a common language like English, Arabic or Spanish. The need for dingbats or emojis is likely comparatively small.

I’ve no problem keeping the status quo, or adding other ranges, but whatever we do, there’ll always be new codepoints assigned and they may or may not contain non letter-like characters. These will always be in the already allowed ranges and therefore we cannot exclude pre-emptively.

@ChristianSi
Copy link
Contributor

ChristianSi commented Jan 21, 2023

I'm fine with including this range except for U+268C to U+268F (⚌ ⚍ ⚎ ⚏), as @eksortso favors (#954 (comment)). That'll be an easy and convenient solution.

Also I urge not to reopen the rest of the discussion about the allowed ranges. We have found a solution that allows unquoted keys in essentially any script, without burdening implementors with too much complexity. That's good, so we should just keep it that way!

@arp242
Copy link
Contributor Author

arp242 commented Jan 21, 2023

ZWJ is used in many scripts to create valid characters, glyphs and words. It’s not exclusive to emojis. I don’t think having it is an issue. More the opposite. The whole idea here is to be inclusive wrt languages and scripts. The side effect of this approach is that some emojis also work, because they are in codepoint ranges not explicitly excluded, mainly because these ranges weren’t assigned to in older versions of Unicode.

You're right, I should have addressed that. TR31 has quite a bit of special handling for it, and the way I read it even allows excluding it. Go and C# outright disallow using it.

I can see how allowing ZWJ makes sense. It's not entirely clear to me if it's needed to correctly write these languages though, or if it's optional. My thinking is "better to include too little and correct that if need be".

Variation selectors are still an issue though. I don't think there's any good reason to include them, they are commonly inserted, and very invisible.

And combining characters introduce a lot of ambiguity in string equivalence, as brought up in #941. The more I think about it, the more I feel we should do our best to reduce the potential for ambiguity as this would at least reduce potential for confusion, and the need for NFC normalisation and a Unicode library (like ICU). Perhaps we can't entirely eliminate it 100%, but just covering the common cases would already go a long way.

Also I urge not to reopen the rest of the discussion about the allowed ranges. We have found a solution that allows unquoted keys in essentially any script, without burdening implementors with too much complexity. That's good, so we should just keep it that way!

None of these specific issues were brought up before, as far as I've seen. You can disagree it's an issue, and that is of course fine, but I'd never dismiss anyone like that.

@ChristianSi
Copy link
Contributor

ChristianSi commented Jan 23, 2023

@arp242: Unicode normalization issues are irrelevant here, since they apply to quoted and unquoted keys in exactly the same way. In quoted keys, arbitrary Unicode is allowed and, of course, that's not going away – in fact, it can't go away since that would break backward compatibility.

And let's not roll back on our promise that "you can use unquoted keys representing words from arbitrary languages", which we have realized in the current state. In this regard, I found Some trivial knowledge about Unicode a good read. My take from there: variant selectors are needed, at least, to wrote Mongolian correctly.

In that article, the usage of ZWJ is mostly limited to emojis, but from Wikipedia I get that it is needed to render text in various scripts (e.g. Arabic or Indic ) correctly. You're right that the text will likely still be readable without this information (I guess?) but it'll look "broken" to people. Also relevant is that text editors will likely auto-insert these ZWJs where needed. So when we tell people "you can use bare keys in Arabic script, but only without ZWJs", this might well cause all kind of parsing errors, since people will have a hard time writing keys in these scripts without this character appearing.

So yes, while you're right that it makes sense to discuss whether this character and the variant selector code block should remain included, I'd still tend to say that yes, they should.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 23, 2023

I agree, they should. Better to be liberal in what you accept, esp when it comes to scripts in Unicode.

Go and C# outright disallow using it.

Wrt C#, this is only partially true. Identifiers in Common IL can be any codepoint, except a small handful, like NULL and FFEF, I believe. In F#, this rule is applied very liberally, and you can create identifiers in the full range Common IL allows. In C#, calling such identifiers requires a little extra work, but is still possible.

Let's not start limiting more. Either expand the ranges, or leave it as is. From the discussion above, I think the conclusion would lean towards inclusion of the extra range, as mentioned here: #954 (comment)

@arp242
Copy link
Contributor Author

arp242 commented Jan 23, 2023

Unicode normalization issues are irrelevant here, since they apply to quoted and unquoted keys in exactly the same way

That is correct, but as I mentioned before quoted keys aren't used all that much, so practically it's much less of an issue with quoted keys. That we need be a bit more careful with bare keys is not controversial, otherwise we would just allow everything except [].="'.

Also relevant is that text editors will likely auto-insert these ZWJs where needed. So when we tell people "you can use bare keys in Arabic script, but only without ZWJs", this might well cause all kind of parsing errors, since people will have a hard time writing keys in these scripts without this character appearing.

Yeah, maybe; it's really hard for me to judge to what degree it's "needed" and "commonly used" and to what degree it's "a feature offered, but not commonly used".

I loaded the Arabic Wikipedia on Mars (just a random featured/long article), and it seems to contain only a single ZWJ (in the title to the Simple English version). I checked a few other random articles and some pages with a lot of text on https://www.my.gov.sa after switching the language to Arabic, and can't find ZWJ there either. Go certainly doesn't allow ZWJ (I tested that), and I can't find any complaints on the issue tracker, mailing list, or anywhere else.

That said, Go != TOML and the context for both is different, and TR-31 contains special rules for handling ZWJ, and I suppose this is an argument for both sides here: "ZWJ is needed in some contexts, so it must be allowed" as well as "ZWJ can be confusing, so we need to restrict where it can appear".

In conclusion: further research needed if it's decided to spend effort on this in the first place. The same applies to variation selectors.

Better to be liberal in what you accept, esp when it comes to scripts in Unicode.

Yeah, I don't really agree with that. "Postel's law" has been widely criticized over the years and I'm hardly the first/only to disagree with it; I'd say it's fair to state that it's pretty controversial overall. It was framed in a very different context, and in a very different world; historically it made a bit more sense due to standards often written up after the implementations, unclear/underspecified standards, harder to actually read the standards so many didn't, "cowboy coding" being the norm, etc. much of that applies a lot less today, and IMHO it doesn't apply to TOML or Unicode.

But my main issue with this is: it doesn't really engage with my concern, which can be summarized as "I feel this has the potential to cause a great deal of confusion, so I think it's better to be conservative initially, and perhaps correct it later if need be". If you want to say "I don't think people will end up being confused" or "I think it's an okay trade-off that people will get confused" then fair enough, as that engages with the stated concerns. But this doesn't really.

From the discussion above, I think the conclusion would lean towards inclusion of the extra range

To be honest, I'd really like some other views on this as well; thus far only three people commented on this.

I realize you might think I'm stubborn and difficult here, but I promise you I'm really not trying to be. I spent a lot of time looking at this over the last few days (which also included considering "is it really worth everyone's time and energy banging on about this?"), and I think this has a huge potential to bite us and people using TOML in the ass. If it was only a matter of "I think doing it like this is nicer" or "I don't like it" I wouldn't have cared to much; I don't like to bikeshed over details and generally "whatever works as long as it's not completely atrocious" is fine with me.

Either way, I probably said everything I wanted to say, so I'll leave it at that for a while, giving other people a chance to catch up, reply, vote, etc.

@arp242
Copy link
Contributor Author

arp242 commented Jan 23, 2023

Posting as a separate comment for votes, I think the core questions are essentially:

  1. Do we want to spend effort to reduce the potential for confusion?
  2. If so, how?

Point "2" has a lot of subpoints, but if the answer to "1" is a "no" then it's pointless to even discuss it.

People can vote on this comment (not using thumbs to avoid ambiguity):

  • 😄: "we should investigate in to reducing the potential for confusion" (investigating the exact details later).
  • 🚀: "it's essentially fine as it is, barring perhaps a few relatively minor details"

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 26, 2023

I feel this has the potential to cause a great deal of confusion, so I think it's better to be conservative initially, and perhaps correct it later if need be

Yeah, but that sword has two edges: it’s similarly confusing if certain names cannot be expressed. If someone wants to use ZWJ, they will typically know what they’re doing. The majority of people will stay away from it, simply because it’s never come up with naming identifiers.

Quote:

The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. Sometimes the Roman script is to be counted as complex, e.g. when using a Fraktur typeface. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

@pradyunsg pradyunsg added this to the 1.1.0-rc0 milestone Feb 20, 2023
arp242 added a commit to arp242/toml that referenced this issue Sep 22, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and ANY solution is a
trade-off. That said, I do believe some trade-offs are better than
others, and after looking at a bunch of different options I believe this
is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is the strongest argument in favour of this and the biggest
  improvement: we can't really do anything wrong here in a way that we
  can't correct later. Being conservative is probably the right way
  forward.

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work", but "this
  character works fine, but this very similar doesn't". This shows up in
  a number of things:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  From the user's perspective this seems like a bug in the TOML parser.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

[1]: Aside: I encountered this just the other day as I created a TOML
     file with all UK election results since 1945, which looks like:

         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]

     That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just
     wrote it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@arp242 arp242 linked a pull request Sep 23, 2023 that will close this issue
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@ChristianSi ChristianSi linked a pull request Oct 27, 2023 that will close this issue
@epage
Copy link

epage commented Feb 7, 2024

If this is one of the remaining blockers for 1.1.0-rc0, what if we instead defer bare keys to 1.2?

@eksortso
Copy link
Contributor

eksortso commented Feb 7, 2024

At this point, I'm inclined to agree, and to make extending the allowable bare keys a primary objective for the future TOML 1.2.0. No offense to all the hard work put forward to make this viable, but while @pradyunsg is still MIA, we should slim down for now to get him back here for a while, work out how to proceed with the standards project, and get 1.1.0-rc1 out the door.

So let's save this issue, and all the other open issues regarding the extension of bare keys, for after 1.1.0 is released, then hit it full-bore with the best solution we can devise, with a scheduled date for release and a dedicated core team surrounding the standard and dealing with day-to-day matters.

Sorry @ChristianSi, I know this is a bitter pill to swallow, but we've waited too long, and we know what else needs done right now.

@ChristianSi
Copy link
Contributor

ChristianSi commented Feb 9, 2024

I don't really see any serious blockers, neither this or anything else. But, as @eksortso has already mentioned, there hasn't been a working maintainer for the last few months (at least), so the project is effectively stuck.

@eksortso If you have an idea on how to solve this, I'd be interested to hear it!

@abelbraaksma
Copy link
Contributor

@ChristianSi wasn't there an issue not so long ago to assign a new maintainer? @pradyunsg, would you allow another maintainer to the team?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants