Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bare key characters to Letter and Digit #990

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arp242
Copy link
Contributor

@arp242 arp242 commented Sep 23, 2023

I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that anything will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML.

Advantages:

  • This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much.

    We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts.

    This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it".

    Being conservative for these type of things is good!

  • This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point.

    For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1]

  • It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!"

    Note that Not all emojis work as bare keys #954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis:

    a.toml:
            Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
            Error:   line 1: expected '.' or '=', but got ';' instead
    
    b.toml:
            Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
            Error:   (none)
    
    c.toml:
            Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
            Error:   line 1: expected '.' or '=', but got '–' instead
    
    d.toml:
            Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
            Error:   (none)
    
    e.toml:
            Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
            Error:   (none)
    

    "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written).

    People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible.

    There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't".

    In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters.

  • This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more:

    '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
    '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
    '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
    '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
    '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
    '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
    '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
    '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
    '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
    'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
    '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
    '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
    '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)
    

    Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe.

    Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment.

  • Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in Clarify that key uniqueness depends only on binary representation, recommend normalization #966 and while views differ (mostly because they're both) it seems to me that making it map closer is better. This is a minor issue, but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are:

  • The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present.

    However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger.

    The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part).

  • There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice.

    I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things.

  • ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational".

    I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds.

Fixes #954
Fixes #966
Fixes #979
Ref #687
Ref #891
Ref #941


[1]:
Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like:

 [1950]
 Labour       = [13_266_176, 315, 617]
 Conservative = [12_492_404, 298, 619]
 Liberal      = [ 2_621_487,   9, 475]
 Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.

I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@pradyunsg
Copy link
Member

pradyunsg commented Sep 23, 2023

  • This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much.

Instead of trying to define custom subsets, have you considered using the characters based on Unicode properties instead? I'm thinking of https://www.unicode.org/reports/tr44/#Alphabetic and https://www.unicode.org/reports/tr44/#Numeric_Type (=Decimal? or Digit?) for these.

The current set of characters ends up excluding multiple languages that don't use Latin/Latin-derived script (eg: none of the Other_Alphabetic characters are included in this set, which are necessary components of various languages -- I have an example below).

  • no "this character is allowed, but this very similar other thing isn't, what gives?!"

This is still the case. As an example I have handy to me: औरंगाबाद (Aurangabad) is a city in India. The codepoints that is composed of are... (w/ name + category):

'औ'
0914
DEVANAGARI LETTER AU
Lo
'र'
0930
DEVANAGARI LETTER RA
Lo
'ं'
0902
DEVANAGARI SIGN ANUSVARA
Mn
'ग'
0917
DEVANAGARI LETTER GA
Lo
'ा'
093E
DEVANAGARI VOWEL SIGN AA
Mc
'ब'
092C
DEVANAGARI LETTER BA
Lo
'ा'
093E
DEVANAGARI VOWEL SIGN AA
Mc
'द'
0926
DEVANAGARI LETTER DA
Lo

The vowels and signs of the script end up being not in the current set (they're in Mc and Mn categories) while the letters are. It is subtle that रग is permitted but रंग is not.

How I printed the decomposed info about that string
# yes, this uses newer Unicode databases, but this part of Unicode hasn't changed since 9.0
import unicodedata

s = "औरंगाबाद"
for c in s:
    print(repr(c))
    print(hex(ord(c))[2:].rjust(4, "0").upper())
    print(unicodedata.name(c))
    print(unicodedata.category(c))

toml.abnf Outdated Show resolved Hide resolved
@pradyunsg
Copy link
Member

pradyunsg commented Sep 23, 2023

Taking a step back from the details, I think this is a good idea. TOML 1.0.0 used:

toml/toml.abnf

Line 50 in 8eae5e1

unquoted-key = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _

Changing that to the somewhat equivalent seems like a reasonable approach to take...

unquoted-key = 1*( unicode-alphabetic / unicode-digit / %x2D / %x5F )

(I still need to review the normalisation discussion though, to see why that concern isn't resolved with a "implementations should normalise in a manner appropriate for their implementation language/context" or something similar)

@arp242
Copy link
Contributor Author

arp242 commented Sep 23, 2023

Instead of trying to define custom subsets, have you considered using the characters based on Unicode properties instead? I'm thinking of https://www.unicode.org/reports/tr44/#Alphabetic and https://www.unicode.org/reports/tr44/#Numeric_Type (=Decimal? or Digit?) for these.

"Alphabetic" is just a list of categories plus "Other_Alphabetic" property. This is similar to what we have now, except that Letter_Number (Nl) and Other_Alphabetic are included.

I don't think we need Letter_Number since all of this seems to be for historic scripts, although it also won't hurt to include it I guess.

We can just add Other_Alphabetic; looking through what that includes, it does have some combining characters: https://gist.github.com/arp242/183881717be3cde197d357bfe90d4541

But all of these seem to be without pre-composed forms, so that's not really an issue and "
there is only one way to represent a key so normalisation is never needed" should still be preserved (I think...)

My main complaint it includes a bunch of a-z variants:

'Ⓐ'  U+24B6  9398   e2 92 b6    Ⓐ   CIRCLED LATIN CAPITAL LETTER A (Other_Symbol)
[...]
'ⓩ'  U+24E9  9449   e2 93 a9    ⓩ   CIRCLED LATIN SMALL LETTER Z (Other_Symbol)

'🄰'  U+1F130 127280 f0 9f 84 b0 🄰  SQUARED LATIN CAPITAL LETTER A (Other_Symbol)
[...]
'🅉'  U+1F149 127305 f0 9f 85 89 🅉  SQUARED LATIN CAPITAL LETTER Z (Other_Symbol)

'🅐'  U+1F150 127312 f0 9f 85 90 🅐  NEGATIVE CIRCLED LATIN CAPITAL LETTER A (Other_Symbol)
[...]
'🅩'  U+1F169 127337 f0 9f 85 a9 🅩  NEGATIVE CIRCLED LATIN CAPITAL LETTER Z (Other_Symbol)

'🅰'  U+1F170 127344 f0 9f 85 b0 🅰  NEGATIVE SQUARED LATIN CAPITAL LETTER A (Other_Symbol)
[...]
'🆉'  U+1F189 127369 f0 9f 86 89 🆉  NEGATIVE SQUARED LATIN CAPITAL LETTER Z (Other_Symbol)

But I guess that's rare enough of a thing we don't need to worry about it.

One issue is that Other_Alphabetic is harder to check than a category; e.g. Python's unicodedata doesn't have anything for this as far as I can see, so you'll need something like unicodedata.category(char) in ['Ll', 'Lu', ...] or other_alphabetic(char) where other_alphabetic() checks if it's included in the range.

I'm not sure if "these categories + Other_Alphabetic property" or "Alphabetic property" is easier; I think the first one probably is?

Per comments; also commit the script used to generate the ABNF ranges.
Probably want to replace that with something less, ehm, crap, so it's
easier for people to modify and run ... This was just quick and easy to
write for me now.
@erbsland-dev
Copy link

On the Pitfalls of Including Unicode Character Classes in TOML Specs

Incorporating Unicode character classes into the TOML specification is more problematic than simply defining code-point ranges, and here's why: Code-point ranges are fixed values, while Unicode character classes are intrinsically tied to a given version of the Unicode Standard. This means that character classes are ever-evolving entities, updated with each new Unicode release.

Tethering the TOML spec to such a mutable component like Unicode character classes creates an undesirable dependency. Every time the Unicode Standard updates, it would essentially trigger a change in the TOML specification, whether we like it or not.

In practical terms, anyone building a TOML parser would either have to rely on bulky libraries like ICU to keep up-to-date with these character classes, or continually revise their parser to align with each new Unicode version. Neither of these options is particularly appealing or efficient.

In summary: Let's absolutely incorporate the necessary code-points to make TOML as inclusive as possible for various scripts. But let's steer clear of adding Unicode character classes to the spec, to avoid creating an unnecessary and burdensome dependency.

@arp242
Copy link
Contributor Author

arp242 commented Sep 24, 2023

This means that character classes are ever-evolving entities, updated with each new Unicode release.

Existing codepoints are basically never changed, and only new ones are added. Lots of stuff runs on old Unicode versions.

Practically speaking, this is a non-issue, and the only problem you might run in to is that an implementation won't support the codepoint you want. However, the new codepoints that have been added tend to be obscure and not commonly used, and the "or later" covers this.

You really don't need an ICU library or have to "continually revise their parser". In fact, lots of systems run on years-old ICU libraries, which is effectively the same, or have some specific Unicode version ossified in the spec (without "or later").

@erbsland-dev
Copy link

I acknowledge that Unicode code-points are rarely, if ever, removed from character classes. However, it's worth noting that they do get added, particularly to the Letter and Number categories.

The issue I see here is that the flexibility in Unicode character classes can complicate validation tests for TOML parsers. Imagine you have a test suite that validates a parser this year; it might not produce the same results next year if the Unicode Standard for both the parser and the test suite diverges.

So, if we're going down the road of incorporating character classes, it's imperative to also specify a minimum Unicode Standard version that's supported. In this way, tests should only hold parsers accountable for character classes as defined in that specific Unicode version.

Additionally, given that reserved code-points will be introduced in future Unicode updates, tests should be prepared for ambiguity. Specifically, they shouldn't mark a parser as 'failed' just because it accommodates a code-point from a future Unicode Standard.

In summary, if we must include character classes, we should also commit to a minimum Unicode Standard version. This adds a layer of complexity and leaves some room for ambiguity concerning which characters are considered valid, but it's a more manageable approach.

@arp242
Copy link
Contributor Author

arp242 commented Sep 24, 2023

it's imperative to also specify a minimum Unicode Standard version that's supported.

That's what it does already; see the diff and commit message.

they do get added, particularly to the Letter and Number categories.

If it's not in Unicode then lots of stuff already won't work, and if it gets added today it will take a few years for the world (including TOML) to catch up. That's fine.

It's really not a practical issue people will run in to, or at least not more so than anything else, and "or later" covers this.

The issue I see here is that the flexibility in Unicode character classes can complicate validation tests for TOML parsers. Imagine you have a test suite that validates a parser this year; it might not produce the same results next year if the Unicode Standard for both the parser and the test suite diverges.

"Flexibility" and "diverges" is rather overstating it; stuff isn't going to get randomly assigned, re-assigned, or unassigned, and many aspects have been very stable for many years. It's more or less "append only", and what is or isn't a "letter" or "digit" is not some hard cutting-edge problem, but fairly well established.

Perhaps an implementation might add an unwise test such as checking for a random unassigned codepoint, but that's just not a smart thing to do, and toml-test will include tests so writing these tests yourself isn't even needed.

I just don't see how this can be a practical issue. And even if it is: the fix is so trivial it's just not worth worrying about.

@erbsland-dev
Copy link

That's what it does already; see the diff and commit message.

My apologies for overlooking the inclusion of a minimal Unicode version in the diff and commit message—that indeed addresses the core of my previous concerns.

If there's a specific Unicode version that serves as the benchmark, then implementors have a stable foundation to build upon. This eliminates the risk of a parser becoming outdated due to Unicode updates, as long as it aligns with the stated minimal version.

In light of this new information, I have no reservations about incorporating Unicode character classes into the TOML spec.

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 24, 2023

While I'm not opposed against Unicode properties as such, I have two problems with this approach:

(1) As currently written, it would remove the possibility to use arbitrary words in any language as bare keys, since combining characters (Unicode categories Mc and Mn, particularly) are not allowed. I suppose that's intentional since @arp242 hopes in this way to implicitly force NFC normalization on bare keys? Right?

When we first added non-ASCII letters to bare keys (#687), the initial proposed was indeed very similar to this one, except that it also included the rule:

Allow codepoints from categories Mc and Mn anywhere in a bare key except as the first character.

And that, while it goes against the idea of "automatically enforced normalization" is indeed necessary to allow arbitrary words in arbitrary languages as bare keys, which is was #687 was all about. For example, the used in the Guarani alphabet is encoded as g with tilde using a combining diacritical mark (U+0303 ◌̃ COMBINING TILDE), rather than a precomposed character. The Navajo language also uses several letters, such as į́ (i with ogonek below and acute above), which don't exist in precomposed form.

Generally, the Unicode people make it very clear that they don't add new characters in precomposed form if a combination including combining marks is already available, see the question Q: Unicode doesn't contain the character I need, which is a Latin letter with a certain diacritical mark. Can you add it? You can read the answer yourself, but the gist of it is that if a letter can already be expressed as "base letter" followed by one or more "combining diacritical marks", no new precomposed character will be added for it, since the combination can be used just fine.

So no, we cannot throw the combining marks out, just as much as we would like to.

(2) My second, much smaller reservation is that #687, as it was finally merged, is the much simpler solution. We had started with a Unicode approach close to this one, but @abelbraaksma then argued for the simpler range-based approach also used in the XML definition, and apparently convinced sufficiently many people, myself included, that that's the better way to go. Consider the 9 lines the character ranges have in the current ABNF compared to the 215 lines they would have if this proposal was accepted.

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 24, 2023

@pradyunsg's point about औरंगाबाद (Aurangabad) is, of course, essentially the same as mine. It hasn't yet been addressed, as far as I can see. Possibly (I don't know) it is possible to write Devanagari in completely precomposed form, but if the software people typically use prefers combining marks that still wouldn't help them, since they would still run into incomprehensible errors when trying to use arbitrary words written in Devanagari as bare keys.

@arp242
Copy link
Contributor Author

arp242 commented Sep 24, 2023

औरंगाबाद is fixed by adding Other_Alphabetic; there's a bunch of other combining marks in there, and I believe this should cover what's needed. I'm not entirely sure if some of these also have a pre-composed form, but many don't (and if there isn't any pre-composed form, then "there is only one way to represent it" is retained). For example for Devanagari there are no pre-composed characters with Anusvara or aa vowel sign that are used in औरंगाबाद.

I find it difficult to get good information on this; I really need to write a script to get a good analysis and I'll have to get back to you on that.


Guarani and Navajo will be harder :-( But ... maybe we should just release and see what happens?

  1. People complain, and we'll cross that bridge when we get there (if ever).
  2. People don't complain, and all is good.

The thing is, with this proposal we can always change course if we make a mistake; other than the Other_Alphabetic property I added after feedback (which I'm not 100% sure we should fully include by the way), we undoubtedly need everything else, so we can always make compatible changes. In fact, we could literally release this proposal as TOML 1.1, and release the current main as TOML 1.2 (but we can't do the reverse), or even do something different from both.

To be honest I wouldn't be surprised if this feature sees very little uptake, with the main usage being the odd diacritic in Latin like Sinn_Féin and fußball, and maybe some of the "major" languages like Chinese. In that case all these discussions would have been mostly for naught.

Or maybe people will start using this a lot; perhaps in ways we didn't anticipate. That's also entirely possible. This is really my main objection against the current state: we just don't know what people will do, and will work well, what will and won't be a practical problem that's encountered, or what people will or won't be confused about. And if it turns out we made a mistake, we can't easily correct it without breaking compatibility.

We don't really know, and we can't really know without real-world experience. We don't really need to solve every possible detail here in one go; someone currently using quoted keys for their Navajo will have to continue using quoted keys a bit longer.

So yeah, maybe just "release and see what happens"? It's certainly a bit of a downside/trade-off, but seems like a reasonable one, especially since the real-world experience will allow us to make a more informed decision later on, if need be.


As for the ABNF: this doesn't fill me with joy either. But I also think it's unimportant; I feel people have been too focused on making the ABNF "nice", but the ABNF file is not an art project and should serve practical goals we have for TOML, one way or the other. 17 people have touched that file since 2014; it's slightly inconvenient for a small group of people. TOML will likely be around for decades, and "ABNF looks a bit ugly" is both minor and fixable: maybe ABNF gains Unicode next year, or we switch to NBNF which has Unicode, or whatever.

For now, unicode-letter / unicode-digit is clear enough for readers, and when people copy/paste to their tool they don't care or notice if they copy/paste a few extra lines.

(I'm also fine with the comments that I had in my original proposal before I amended it after Pradyun's feedback; that was also discussed as an option before)

@marzer
Copy link
Contributor

marzer commented Sep 24, 2023

This seems like a good middle-ground to me. It is very close to the original logic proposed in #687 so I already have lookups implemented in TOML++ in a very-nearly-conforming manner, so that's a plus. My only note:

based on what these 5 European/American guys think

ahem and at least one Australian 🇦🇺 😅

@ChristianSi
Copy link
Contributor

@arp242: Is औरंगाबाद really covered by the Other_Alphabetic property? I must say I'm a bit stumped by what Other_Alphabetic even is and what it's meant to be used for. My googling power has somehow failed me here. But I thought the gist file you posted contains all characters that have this property? If so, that wold be insufficient – unless I'm blind, there aren't any DEVANAGARI marks there (DEVANAGARI SIGN ANUSVARA and DEVANAGARI VOWEL SIGN AA are two mentioned by @pradyunsg , doubtless there are others).

@pradyunsg
Copy link
Member

pradyunsg commented Sep 24, 2023

https://www.unicode.org/Public/14.0.0/ucd/PropList.txt (or the 9.0.0 link) is likely the best place to check whether things are in the list.

IIRC, Devanagiri relies on both Md/Mc category characters and the relevant codepoints get the Other_Alphabetic property applied to them as well.

@arp242
Copy link
Contributor Author

arp242 commented Sep 25, 2023

Looks like something went wrong copy/pasting in to that gist file; actually looks like pasting in gist is pretty broken in general because I can't get the damn thing fixed (leave it to the frontend people to break pasting text...), so I put it here: https://pastebin.com/p1ty4NXn

I use my uni tool for this kind of stuff by the way; for example to show that Other_Alphabetic includes enough for औरंगाबाद:

% uni id औरंगाबाद
	 CPoint  Dec    UTF8        HTML       Name (Cat)
'औ'  U+0914  2324   e0 a4 94    औ    DEVANAGARI LETTER AU (Other_Letter)
'र'  U+0930  2352   e0 a4 b0    र    DEVANAGARI LETTER RA (Other_Letter)
'◌'  U+0902  2306   e0 a4 82    ं    DEVANAGARI SIGN ANUSVARA (Nonspacing_Mark)
'ग'  U+0917  2327   e0 a4 97    ग    DEVANAGARI LETTER GA (Other_Letter)
'◌ा'  U+093E  2366   e0 a4 be    ा    DEVANAGARI VOWEL SIGN AA (Spacing_Mark)
'ब'  U+092C  2348   e0 a4 ac    ब    DEVANAGARI LETTER BA (Other_Letter)
'◌ा'  U+093E  2366   e0 a4 be    ा    DEVANAGARI VOWEL SIGN AA (Spacing_Mark)
'द'  U+0926  2342   e0 a4 a6    द    DEVANAGARI LETTER DA (Other_Letter)

% uni print Other_Alphabet | egrep '(0902|093E)'
'◌'  U+0902  2306   e0 a4 82    ं    DEVANAGARI SIGN ANUSVARA (Nonspacing_Mark)
'◌ा'  U+093E  2366   e0 a4 be    ा    DEVANAGARI VOWEL SIGN AA (Spacing_Mark)

(or use uni id -f '%(props) → %(name)', but I usually find grep easier).

That pastebin is just the output of uni print Other_Alphabet.

In general it's a reasonably handy frontend for the Unicode database, or at least it is for me, if you're the sort of person who likes commandline tools anyway.

@wjordan
Copy link

wjordan commented Sep 25, 2023

(with apologies for jumping in with that might be an obvious comment to those who have been discussing this for years with lots of built-up context)

Why go through such contortions to define a bespoke identifier syntax, instead of using the default Unicode identifier syntax (essentially XID_Start XID_Continue*), with a few exceptions to support existing TOML (0-9-_)?

Is the only reason that the Unicode identifier set includes marks that may or may not be fully NFC-normalized (the NFC_Quick_Check=Maybe set), and some TOML-parser implementations don't want to implement the Unicode Normalization Algorithm (or depend on any library that implements it), so you're reluctant to require NFC-form identifiers in the spec?

Personally, I think that if TOML wants to support Unicode, it's not unreasonable to expect parsers to support Unicode as well. However, an approximation of NFC-validation by filtering out NFC_Quick_Check=No characters from the identifier set might be a reasonable compromise. The spec could say that NFC identifiers are required, but allow for some minimal TOML parsers to rely on a less-than-perfect quick-check NFC validation (just like it's currently allowed for TOML parsers to differ in their support for greater-than-millisecond time precision), while more accurate TOML parsers validate NFC with the full normalization algorithm.

Also: if you want a minimal set of security-conscious characters, you could additionally filter the identifier set with the Allowed property, which removes a bunch of obsolete/unused scripts and confusable characters (including NFKC_Quick_Check=No, or characters that would never appear in NFKC form).

@marzer
Copy link
Contributor

marzer commented Sep 25, 2023

@wjordan

Is the only reason that the Unicode identifier set includes marks that may or may not be fully NFC-normalized (the NFC_Quick_Check=Maybe set), and some TOML-parser implementations don't want to implement the Unicode Normalization Algorithm (or depend on any library that implements it), so you're reluctant to require NFC-form identifiers in the spec?

Short answer: yes, that's the main reason. See this discussion for context: #966. You may also wish to visit the initial issue and subsequent pull request that got us here.

@arp242
Copy link
Contributor Author

arp242 commented Sep 25, 2023

And #941

@wjordan
Copy link

wjordan commented Sep 25, 2023

Short answer: yes, that's the main reason.

Thanks for the links. NFC validation doesn't seem to me to be as terrible as it's been made out to be in those previous discussions:

  1. There are slimmer-than-ICU Unicode implementations such as utf8proc (800 lines of C) and unilib that should be more suitable for statically-linked/embedded use-cases.
  2. The normalization algorithm is not trivial, but it's also not impossibly complex or resource-intensive: it should only add ~100 lines of code and ~17 KB of memory.
  3. NFC quick-check is a simple table lookup (equivalent to filtering out non-NFC characters from an allowed list of code-points) that's a decent approximation to full NFC validation.
  • You could get an even-better approximation by rejecting any sequence of code-points that matches a canonical decomposition including any of the NFC_Quick_Check=Maybe code-points, which would also be a simple regex or table-lookup. (I found 951 of these in the latest Unicode data.) This additional check would catch non-NFC identifiers like café and ñaña for example.

@marzer
Copy link
Contributor

marzer commented Sep 26, 2023

Sorry @wjordan but I disagree with you that "only 17kb of memory" is fine; for some embedded environments that's absolutely a deal-breaker (I maintain a C++ TOML library and many of my users are embedded). I'm not going to implement any form of normalisation.

There were also other reasons that were unrelated to implementation difficulty, too.

In a more meta sense, I should point out we've debated normalisation a great deal over the last year or so, and have only just reached a consensus of sorts. I very much hope the ship has sailed.

@arp242
Copy link
Contributor Author

arp242 commented Sep 26, 2023

"We want normalisation" is also something we could reconsider if we spec things so that only one form is allowed so that normalisation is never or extremely rarely needed. I have no plans for this, or expectations that we need to, it's just nice to have the option and keeping doors open is good, just in case we might reconsider things in 10 or 15 years or whatever.

And I think you're making fine points @wjordan, it's just ... there's been a lot of discussions about this (which are still on-going), and they've been difficult at times, and I think people have become a bit tired of it all 😅

@wjordan
Copy link

wjordan commented Sep 26, 2023

I understand you're all exhausted by normalization talk, but the stiff resistance to adopting any non-trivial isNFC() test (or allowing minimal implementations to use simple approximations) is still the sticking point at issue in this PR, isn't it?

To recap: There are 111 combining characters (NFC_Quick_Check=Maybe, mostly Mc/Mn spacing/nonspacing marks) that compose with other characters but are also sometimes in NFC words because equivalent precomposed characters don't exist for all combinations. The approach in this PR is effectively to exclude these characters to ensure any NFC check is not necessary, since any expression formed with only NFC_Quick_Check=Yes characters will be valid NFC.

As pointed out already, these combining marks are required in some Guarani and Navajo expressions, but my related, broader concern is that this will certainly affect other languages, and we don't have any idea or estimate on that total impact. (What other languages/words out there require U+0300 | COMBINING GRAVE ACCENT? I have no idea- is it a whole lot?)

This concern would be avoided by including the missing combining characters (effectively adopting UAX31 identifiers), and doing the bit of extra work to validate that those combining characters are used in NFC-valid contexts. This would make the spec much simpler, and more aligned with programming languages that have already adopted similar identifiers. But if you've already closed off discussion on that point and need to avoid any non-trivial Unicode normal-form logic whatsoever in order to ensure 100% compatibility with minimal embedded implementations of TOML, my preference would be to back out Unicode support entirely (#979) rather than offer incomplete Unicode-identifier support that rejects valid identifiers in an unknown number of languages.

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 26, 2023

While I appreciate the thoughts @arp242 has put into this proposal, I too see it as a step into the wrong direction compared to what we already have in the current main branch. If we move away from @abelbraaksma's simple and robust proposal towards a Unicode-category-based solution, we should do it properly, that it as the described in the first comment of #687. Hence with combining characters (categories Mc and Mn) allowed anywhere in a bare key (except as the first character), instead of the clever but incomplete hack to go for the Other_Alphabetic property instead.

If we go for the proposal here, we break Unicode, since Unicode was never meant to be usable without combining characters. They are an integral part of Unicode's concept of "letters". With this proposal, we couldn't say any more "now you can use arbitrary letters in bare keys, no matter what language", but only: "you can use some letters good for some languages", and if people then ask whether it's good enough for their language, we'll probably have to respond: "we don't know, ask your local Unicode experts, but the smaller your language is, the bigger the changes are that it's not".

I'm sorry, but that's just not good enough. We had promised to allow bare keys in arbitrary languages, now let's deliver. Especially since one very well thought-out solution is ready, developed by @abelbraaksma who I suspect knows more about Unicode than all the rest of us together.

Normalization is nothing we have to worry about, since we have already all but decided that we won't do it. @abelbraaksma's solution is very similar to what's allowed in XML names, and while XML doesn't normalize either (by default), in 25 years of XML history I haven't heard any complaints about that. JSON doesn't normalize either and still allows arbitrary strings as keys, and I haven't heard any complaints about that either.

So, let's not break things out of worry about a problem that doesn't even exist in the first place.

@arp242
Copy link
Contributor Author

arp242 commented Sep 26, 2023

The current main also "breaks" Unicode, just in a different way, with different failure modes and edge cases. I don't consider it any more or less "robust" than this; actually, I consider it much less robust as we won't be able to amend or fix things in future revisions, as it pretty much closes the door to that.

TOML is not XML; I don't think we can really strong draw any lessons from it, not without real-world experience anyway. In XML I have no idea if 1) people actually use it in the first place, and 2) how, and 3) if that worked out well for them. And even then: XML is much more of an interchange format between systems than a human-edited format like TOML.

Also remember the XML spec is from 2005 and that things have changed since then. This is why only a subset of symbols work: it goes to some effort to exclude the blocks that were defined in 2005, but the symbols and emojis from the SMP are all allowed (which wasn't in use in 2005 – also see: MySQL's infamous "utf8" support).

The XML spec is showing its age; the authors couldn't have predicted the future so that's only to be expected, but we shouldn't copy aspects that are showing their age almost 20 years later (and we also don't know if it actually worked out well for XML authors using non-ASCII in the first place). If anything, I feel the lesson is that their approach comes with serious downsides, because none of us know exactly what Unicode 35 will look like.

In short, just remove the restriction (and allow much more) or update it to include all the new symbols. Leaving it in place half-arsed is the worst option.

And as far as I'm concerned "allow everything" except syntax (=[]{}.), control characters, and maybe a few other symbols for future use is far preferably to what we have now, as it will at least fix the consistency issues. This is what e.g. YAML does.

As I mentioned, I don't think there are any "perfect" choices necessarily, just "better" and "worse" trade-offs, but the current version is by far my least favourite trade-off and I feel almost any other option is better.

let's not break things out of worry about a problem that doesn't even exist in the first place.

No one is using this feature, so it's not surprising there isn't an existing problem. I think the current approach has a lot of potential for problems, but of course I can't be sure about this: no one can, not without real-world experience. This is why I keep banging on about "we can always adjust later" as a major advantage of this. The author in #989 was right that we should just release and see what happens to get real-world feedback; it's just that with the current proposal we can't actually do that because once we release it we're more or less stuck with it (the only direction we can meaningfully go to is "allow almost everything").

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 29, 2023

@arp242:

The current main also "breaks" Unicode, just in a different way, with different failure modes and edge cases.

You just say that, but without any evidence to back it up. We never promised to allow only letters in bare keys, hence we don't break anything if we also allow some other stuff.

Also remember the XML spec is from 2005 and that things have changed since then

Actually XML 1.0 it from 1998; I don't think the XML names definition has significantly changed since then. But we didn't just blindly copy it, there was a lot of discussion about whether this is the right approach and about getting the details right in #687 and #891, as you should remember too – after all, you participated in these discussions too.

we should just release and see what happens to get real-world feedback; it's just that with the current proposal we can't actually do that because once we release it we're more or less stuck with it (the only direction we can meaningfully go to is "allow almost everything")

I think that we might want to recheck and possibly revise things in TOML 1.2 (or maybe even later) is a valid concern. But I don't think we need to re-open the careful and slow consensus process (cf. the hundreds of comments that went into #687 and #891) because of that. Instead let's just add a sentence such as:

Note: The support for non-ASCII characters in bare keys is currently experimental. While future versions of TOML will continue to support a wide range of non-ASCII letters in bare keys, the exact details of the supported character ranges may change and it is possible that future versions of the spec will exclude code points that are currently allowed.

I propose to ship TOML 1.1 with this language (or something close to it). In this way, we keep the future open and could, if the need really arises, still switch to a Unicode-category-based implementation in TOML 1.2 or do some other adjustments, without breaking any SemVer promises.

@arp242
Copy link
Contributor Author

arp242 commented Sep 29, 2023

there was a lot of discussion about whether this is the right approach and about getting the details right in #687 and #891, as you should remember too – after all, you participated in these discussions too.

[..]

But I don't think we need to re-open the careful and slow consensus process (cf. the hundreds of comments that went into #687 and #891) because of that.

"This is actually really inconsistent" wasn't brought up. I brought it up later in #954. Normalisation also wasn't discussed at all.

And sure, there were lots of comments, but I also just gave up commenting in #891 and unsubscribed. I figured "I think this is bad, but I can live with it, I guess", and people didn't seem especially receptive to actually questioning the entire approach (in particular after your comment, which I interpreted as "this is the approach we want to take, so stop saying it's bad").

And granted, I was late to the discussion, but people discussed things in #687 between Dec 2019 and July 2020. Surely "you should have commented in that 6 month window or forever be silent" can't be the way things go? Especially not if additional issues that were never even brought up are raised later?

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 1, 2023

@arp242: I can understand you're frustrated since you opened #954 long ago and not much has happend there for a long time. However, I suggested a solution based on a proposal by @eksortso, and @abelbraaksma supported it too: #954 (comment). I can't remember hearing any objections, so I'd still say that's a good way to do it. Maybe I can prepare a PR for it one of these days.

As for normalization, that affects bare and quoted keys in exactly the same way, so we need to address it anyway. We'd even have to address it if we had no bare keys at all, like JSON. And we have a solution ready to be adopted, the same one as used by JSON too.

Other than that, what do you think of my proposal above to declare the relaxed rules for bare keys as experimental? That should address your main concern that we get stuck with something we might later want to revise, right?

@ChristianSi
Copy link
Contributor

@arp242: Also, thanks for the link to your uni tool. That's really useful!

@arp242
Copy link
Contributor Author

arp242 commented Oct 1, 2023

As for normalization, that affects bare and quoted keys in exactly the same way, so we need to address it anyway. We'd even have to address it if we had no bare keys at all

I agree, but few people use quoted keys so it's not really that much of a practical issue.

what do you think of my proposal above to declare the relaxed rules for bare keys as experimental? That should address your main concern that we get stuck with something we might later want to revise, right?

I'm not a huge fan to be honest. My issue is that for me, "compatibility" means "it doesn't break people's files". That it's marked as "experimental" is something most people won't see, so in practice, it will break files.

And people don't update their implementations, dependencies, files, etc. right away, so it might be quite a while before we get meaningful feedback.

So I would personally be very hesitant to revert any "experimental" feature, especially one so user-facing as this. I don't see myself being in favour of that, even if I think it's not a good feature, unless there really are overwhelming amount of problems with it.

@abelbraaksma
Copy link
Contributor

I haven't followed the whole discussion here, but I think the original post / PR is about using categories (Letter and Digit) from Unicode. I've strongly opposed this idea before and I will continue doing so, as it means versioning complexity w.r.t. Unicode. I.e., what version of Unicode are you going to require to support, and how will that reflect on implementation libs that do not have good Unicode support to begin with.

Furthermore, there have been instances where codepoints have been moved from one category to another, which causes yet other issues.

"Be liberal in what you accept "was a prime goal when writing the original change that led to the inclusion of a wide range of international codepoints. There will always be characters that can be confusing, but that is a judgment we should leave to the writer of the TOML files. Who are we to decide what is and what is not a legible key character?

Another argument for the way we did it was to be forward compatible (i.e., any unassigned Unicode code blocks are allowed by default, somewhere above someone claimed the opposite, but that's simply not true, unless there's a bug). Using any version with Letter/Digit categories is not going to be that. Plus that it will introduce conflicts between implementations.

We tried to use the lessons learned from other standards that went through the same process and often regretted earlier decisions (i.e., allowing Letter/Digit and tying the standard to a minimal Unicode, introducing compatibility issues).

Anyway, this is an open standard (well, just about) and if there is consensus to move in a different direction, I am certainly not going to lie in the way 😆. As the OP mentioned, there's no "perfect" solution here and each approach has up and downsides.

@eksortso
Copy link
Contributor

I've just barely followed the discussion about what to allow for bare keys, except when it was first being discussed, and my only contribution was noting that adding what we had at the time would triple the length of toml.abnf. And I did suggest we leave out characters that looked like the equal sign. But aside from those things, I know the conversation is in capable hands, however it trended. I intend to keep out of recent arguments about bare keys, knowing my limits.

That said, I do remember that we did lift a page from XML and allowed for a broad swath of code points that would satisfy an international user base. And we did reject using Unicode classes, because our standard would fluctuate a lot based on whichever version of Unicode we adhered to.

But let's set all that aside for a bit. We have tools to automate ABNF generation, and it would fall upon us to use those tools consistently with every release if we decided to track Unicode classes. The use of those tools, then, must be standardized, if only for our own use. The output of those tools would allow us to release an addendum to the spec to describe which code points are allowed. The addendum wouldn't be used for well-formedness (we'll keep the ABNF minimal that way) but for validity. The new spec with addenda would require testing, and we would need to keep toml-test up to date, and it's good that the testing suite is getting the love it needs nowadays.

And if users hit upon something we missed or would cause problems, all these different processes and documents would need revisions. Well, hopefully not the processes. But we'd all need to keep on top of these things.

Just some of my stray thoughts before I head into work.

@abelbraaksma
Copy link
Contributor

But let's set all that aside for a bit. We have tools to automate ABNF generation, and it would fall upon us to use those tools consistently with every release if we decided to track Unicode classes.

But while that would solve our own issue with creating the spec, it has a bunch of side effects:

  • we are not future-proof anymore
  • users cannot use unassigned codepoints anymore
  • TOML files will become less interchangeable
  • if users want to use a newly assigned codepoint, they will have to wait until their implementation of TOML has caught up
  • future versions will not necessarily be backward compatible anymore (category changes)
  • we would need to align versioning with Unicode versions (if they have a new version, we need a new version)
  • ... a lot more

Creating unicode ranges independent of any foreign versioning and whims would fix all of the above. There's a reason other standards went in that direction as well. No lock-in and no lock-step needed with existing versioning.

@arp242
Copy link
Contributor Author

arp242 commented Nov 22, 2023

There will always be characters that can be confusing, but that is a judgment we should leave to the writer of the TOML files. Who are we to decide what is and what is not a legible key character?

So why bother to specify ranges at all then?

This is a nice soundbite, but meaningless and doesn't address anything.

We tried to use the lessons learned from other standards that went through the same process and often regretted earlier decisions (i.e., allowing Letter/Digit and tying the standard to a minimal Unicode, introducing compatibility issues).

What "other standards" are these? What specifically have they come to regret? Because the only one you've ever mentioned is XML, which made a decision made 25 years ago, when Unicode was in a rather different place, and almost everything else that I've been able to find uses some form of Unicode database.

@abelbraaksma
Copy link
Contributor

So why bother to specify ranges at all then?

I agree. There is no inherent need for that, except for certain special characters. Which is in part why we include much more than just Letter/Digit. We've briefly looked into precisely this idea.

This is a nice soundbite, but meaningless and doesn't address anything.

Sorry if you see it that way. To me it is a very important part. We should not try to be prescriptive. People can think for themselves. If I would encounter a Chinese TOML file I won't understand it. Neither would I if it was written in smileys. But that's not for us to decide. The whole idea with this is to be as liberal as we can be (see previous point).

Because the only one you've ever mentioned is XML

At the time it was a long discussion in the W3C WG spanning multiple standardization committees. Indeed, this includes XML, but also HTML, SVG, XPath and iirc, several standards that aren't partially derived or dependent on XML (for instance, you see a similar decision in the HTTP standardization process: the more recent the standard, the more permissive the ranges, generally speaking).

that I've been able to find uses some form of Unicode database.

Did you look at standards that have a lot of implementations, or did you look at languages, that often have only one or two implementations? Like C# or Java or Python?

As I mentioned above, in the end, if there's consensus to got the Unicode Categories way, or similar, fine with me. I am just warning against it as I feel strongly that it is a step in the wrong direction.

You didn't address issues with versioning, being future proof, and differences between implementations, or implementations that don't have access to a Unicode Database, or that want to be lean and mean (the C++ Unicode is many dozens MBs in size iirc), and allowing Unicode to grow without having to update TOML, and how to deal with ranges that are currently unassigned that become assigned later.

To me, these are insurpassible issues, unless you accept the downsides and just tell the folks to "live with it". Differences exist, just check each implementation's details. That's not the end of the world or anything, I just want us to make a (very) conscious decision on something that already has been discussed at length and (very) consciously been decided before. In my view, if you change a previous decision by 180 degrees, then there should be an even stronger argument in favor of it.

Again, not trying to say "don't do it". Just tying to say "tread carefully, the path is treacherous..." ;).

@ChristianSi
Copy link
Contributor

In my humble (and totally unbiased 😉 ) opinion my alternative PR #1002 (which extends the ranges just gently and improves the textual clarification) would be the best way to go. But leaving the ranges as they are in the current main branch should be acceptable as well (though I really would improve the wording in the written spec, as the main branch text is a bit misleading there).

@arp242
Copy link
Contributor Author

arp242 commented Nov 24, 2023

You didn't address issues with versioning, being future proof, and differences between implementations, or implementations that don't have access to a Unicode Database, and allowing Unicode to grow without having to update TOML, and how to deal with ranges that are currently unassigned that become assigned later.

I already addressed all of this in the original PR message or the following discussion.

This is like drawing blood from a stone.

Vague references to standard running in to problems using this approach

"So what specific problems then?"

More vague references

Well okay then 🤷

something that already has been discussed at length and (very) consciously been decided before. In my view, if you change a previous decision by 180 degrees, then there should be an even stronger argument in favor of it.

And I did mention back then, and then too I was told "we already discussed this, so fuck off" in so many words. So I unsubscribed and shrugged.

Some one else objected a few weeks ago. He was told to fuck off as well and hasn't returned since.

Never mind some issues were not mentioned even once before the entire thing was merged (mainly normalisation).

The sheer length and verbosity generated on this right from the start makes it hard for anyone to pitch in.

So "discussed at length" means bugger all.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Nov 24, 2023

"So what specific problems then?"

I listed them above, and then one by one in summary, and they are present in this PR. The part from the standards committees was very explicit. They locked themselves into Unicode versions, and it became impossible to create a new version of the standard without becoming backwards incompatible.

We decided before to have a simple range, not locking into Letter/Digit definitions as they are way too complex (you added several hundred lines to the ABNF). If we go this way, we should do it properly:

  • Let every implementation state what Unicode version they reference
  • That version of Unicode defines Letter/Digit pairs
  • Add the Private Use ranges
  • Add all ranges currently unassigned

Do not lock your own standard with another standard. If you do need to reference a version, make sure it is forward compatible (i.e., think of what would happen if you never update TOML anymore, are we screwed, or are we OK?).

If there really is a good reason to go this more complex approach, harder to implement, but we can swallow that pill, I guess, then please, find a dynamic approach and let's do it properly. But realize that this is the second attempt we try that and I feel very much that this discussion is going in the exact same direction as back then. That is not a criticque, open standards have a way of doing that from time to time.

I was told "we already discussed this, so fuck off" in so many words. So I unsubscribed and shrugged.

I doubt I ever said anything to that respect. I usually link to the places where something is already discussed if I make such claim, but it is a lot of work, if I did make you unassign, I apologize. When something takes over a year to implement, there is bound to be a lot of discussion. I doubt I remember it all now. Pity you unsubscribed. If I did that every time I didn't get it my way, there wouldn't be a repository left that I could participate in. Glad you came back, thanks for that!

He was told to fuck off as well and hasn't returned since.

Well, that's not nice. I have missed that entirely. But we should try to stay polite to one another. We all strive toward one goal: ensuring that TOML remains the brilliant, minimalistic standard it is.

So "discussed at length" means bugger all.

Please refrain from offensive language. I'm really trying here, we're all volunteers and doing this for the benefit of the community. I was not referring to this PR, but to the previous one, where we had several moments with voting and recollecting our thoughts by summarizing them for everyone involved.

We really went deep into the subject and researched several options. If I missed anything in this long discussion, I apologize. But I have not found an argument that convinces me to go to a sub-optimal solution. It may be here, but I did not catch it.

@arp242
Copy link
Contributor Author

arp242 commented Dec 3, 2023

"So what specific problems then?"
I listed them above, and then one by one in summary, and they are present in this PR.

No you didn't. You listed vagueries and platitudes. "People have come to regret it" is a worthless argument without knowing why specifically, what problems people reported, etc. Mindless citations of Postel's "law" is even more worthless (aka "Postel's thing he said 40 years ago that may or may not be applicable to some scenarios").

Pity you unsubscribed. If I did that every time I didn't get it my way, there wouldn't be a repository left that I could participate in.

If you find yourself disagreeing all the time then maybe that's something you need to look at, because in my 25 years of participation strong disagreements have been exceedingly rare.

But you know what? I'm done here for now. Merge whatever atrociously broken faux-Unicode nonsense you want. The ONLY way this will not cause problems if no one will use it.

So "discussed at length" means bugger all.

Please refrain from offensive language. I'm really trying here, we're all volunteers and doing this for the benefit of the community.

Pretending to be offended by a statement like that is some bad faith nonsense, especially since it conveniently allows you to ignore why I said that. This has been a recurring theme; just ignore the inconvenient and continue as if it wasn't said.

No wonder almost no one actually working on TOML is participating here any more and it's been hijacked with random people from the internet with Very Strong Opinions™.

@eksortso
Copy link
Contributor

eksortso commented Dec 3, 2023

No wonder almost no one actually working on TOML is participating here any more and it's been hijacked with random people from the internet with Very Strong Opinions™.

I don't want to get involved with this proposal. I simply have no time. But I've been involved with the TOML project in general for at least seven years, and @arp242 you still somehow suggest that I'm just one of those random opinionated people you're railing against. That is not fair, and it's not good. You've been doing great work on toml-test since you started maintaining it, and your participation on the toml specification is appreciated. But your outbursts are beneath you.

As I've suggested elsewhere, the alternative to this needless topic churn is an "enhancement proposal"-style document (like a Python PEP or Rust RFC) to keep down the bad feelings and raise up good choices. Especially if nobody can keep track of all the arguments one way or another. (I sure can't, but I would like to get into our history.) Doing this means compiling arguments and contexts that have already been stated, even those you disagree with, but the compassion would serve not just the standard project but also your own line of argument. If anyone wants to start setting precedents for these things, do please share your work with us.

Once I finish moving back to my hometown at the end of the year, I think I will begin this type of work, using the TOML wiki. I'll compile everyone's very strong or barely mentioned opinions on many topics, while treating my own opinions as no more special than anyone else's. I won't address bare-key characters just yet; I'll start small, with e.g. what characters to allow in comments (#996, #924, and earlier). After this, then I'll participate more, and try to put this particular proposal in context, so that we have minimal and obvious documents explaining ourselves objectively that we can share with each other and with random outsiders.

Meanwhile, I beg you and @abelbraaksma and everyone I've worked with in years past to:

  1. be as civil as you are opinionated,
  2. go re-read or restate previously made positions, and
  3. assume good-faith arguments until we no longer need to keep repeating ourselves.

@arp242
Copy link
Contributor Author

arp242 commented Dec 3, 2023

This is veering hugely off-topic, but I definitely think what TOML needs less argueing from ivory towers and more firing up of editors and IDEs and actually writing stuff. In the end arguing isn't actually worth anything.

As it stands the spec is just broken as it includes an example with an emoji, which just doesn't work for the general case. That it's nonetheless included as an example demonstrates that ya'll didn't even understand your own spec.

And you really didn't need me to tell you that. You could have found that out yourself if you had actually bothered writing test cases. Again: more firing up of editors.

All of this has been a recurring theme on a number of issues now. When I actually did the work of implementing duration and file size suffixes as a prototype and found it didn't actually work for reasons that were never mentioned by anyone, your response was "we discussed this at length already". Seriously...? All of that was completely worthless because the real important issues were not discovered until someone actually did the real work instead of bikeshedding about details.

So yeah, turns out you do need to actually work on things to know where the problems are and what does and doesn't work, and while in principle I'm willing to listen to anyone when discussions are dominated by people who don't actually do anything other than argue here (not even writing a test case!) then I do think there's a bit of an issue...

@ChristianSi
Copy link
Contributor

ChristianSi commented Dec 6, 2023

@arp242:

As it stands the spec is just broken as it includes an example with an emoji, which just doesn't work for the general case. That it's nonetheless included as an example demonstrates that ya'll didn't even understand your own spec.

Yeah, nobody is perfect at first attempt, and though that's not exactly a bug, I agree it's confusing and should be removed as an example – as I also do in #1002.

But I also think the most promising way forward is to straighten out such little irregularities if one finds them, rather than throwing more or less all prior work away and starting again from scratch, as this PR advocates. Well, there may be rare cases where starting from scratch is indeed best, but I'm pretty unconvinced that this is one of them. Especially since the solution you propose would be way more complicated than the one we already have, and it wouldn't solve any real-life problems that the existing solution doesn't solve just as well.

Regarding the case of the duration and file size suffixes: I wasn't much involved with that, so I don't know any details, but it seems a very different case, since that one has so far remained in the exploration and discussion phase and is not scheduled to make it into TOML 1.1. For more comprehensive bare keys, on the other hand, we have a working solution that has been merged and is really to be shipped. (Though I agree that there is still some way for improvement, as I suggest to do in #1002 – but that's the way of incremental progress rather than a radical break.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants