Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that key uniqueness depends only on binary representation, recommend normalization #966

Open
SnoopJ opened this issue Mar 9, 2023 · 56 comments · May be fixed by #990
Open

Clarify that key uniqueness depends only on binary representation, recommend normalization #966

SnoopJ opened this issue Mar 9, 2023 · 56 comments · May be fixed by #990

Comments

@SnoopJ
Copy link

SnoopJ commented Mar 9, 2023

I've just learned about #891 and I'm excited to see that the TOML specification is improving Unicode support.

Do I understand right that this changeset makes no recommendations for implementers when it comes to equivalence of keys? I see a note on normalization on the related issue, but if I understand the PR correctly, keys that are "equivalent" under one of the normalization forms of UAX#15 will be distinct under the specification unless their binary representations are identical.

That note suggests that a warning/suggestion for implementers might be added to the spec, but it looks like that never happened. This issue is a request that such a note be added to at least make implementers (and users of parsers that don't bother with normalization) aware of the potential confusion of keys, as in the example of ñaña (NFC form, 6 bytes in UTF-8) vs. ñaña (NFKD form, 8 bytes in UTF-8).

@pradyunsg
Copy link
Member

See also #954

@eksortso
Copy link
Contributor

We did discuss key comparisons and issues regarding normalization in great detail in the discussion on #941. It's a very deep issue! In fact, some languages and tools tend to perform normalization implicitly.

@SnoopJ I personally like the idea that keys be normalized during comparisons, but we shouldn't make normalized or binary comparisons mandatory in all cases. In old-fashioned parlance, that's not a MUST nor a MAY, but a SHOULD.

During that past discussion, I suggested something like the following be added to toml.md, though this is different from the original in that I heavily favored NFC in my original draft:

Because some keys with different code points look the same, parsers should compare keys using a specific normalized form of the keys, rather than just using binary comparisons.

What are your thoughts on this text? Could we make the point more clear?

And also, what about including this line for TOML emitters?

Likewise, encoders should write keys using a specific normalized form.

Thoughts?

@ChristianSi
Copy link
Contributor

ChristianSi commented Mar 11, 2023

@eksortso: I like the sentence about parsers, but instead of "a specific normalized form" I'd write, "Normalization Form C (NFC) or Normalization Form D (NFD)". The NFK... forms normalize too aggressively (equating with ..., say), hence they must not be used!

I'm less sure the sentence about encoders is needed, but I guess it makes sense.

@marzer
Copy link
Contributor

marzer commented Mar 11, 2023

I remain opposed to recommending any specific normalisation in the spec for all of the reasons I outlaid in #941 (none of which were satisfactorily rebutted IMO). The biggest one is still the problem of normalisation requiring a third-party dependency for low-level languages like C and C++; that's a non-starter automatically for most users when choosing a library. Requiring users to "just link with ICU" is not a solution to this problem because it still complicates their toolchain and is not necessarily always available.

And again I'd like to note some caution around the RFC-like "SHOULD" parlance - we either use the RFC terminology, in which case we must state so clearly in the spec document, or we use plain English. The word "should" carries quite different connotations between the two contexts. Whichever we choose, it must be consistent throughout the whole document.

My end-game fear is that if some form of normalisation is recommended in the spec, and that normalisation is tested in the TOML-test suite, then implicitly it becomes a requirement. The simplest and most portable path forward is to do as OP is suggesting to simply clarify that key comparison is ordinal, buyer beware.

@eksortso
Copy link
Contributor

eksortso commented Mar 11, 2023

The NFK... forms normalize too aggressively… hence they must not be used!

@ChristianSi I take it you've got some war stories to tell about that? How are NFKD and NFKC, as defined by Unicode, generally received by people who look at them for normalizing identifiers? What consequences did they face when they chose to use these forms?

Unless convinced otherwise, I say that the specific normalization form chosen is not our choice to make, except that whatever choice is made ought to be standardized. Maybe there's another strict standard I've overlooked that would work just as well. Or, UTF-8 bytes could be the standard, which would still be allowed, even if it's not recommended.

@marzer I'll review your arguments later, and do note that my "old-fashioned parlance" comment was written with your past comments in mind! ;) But to my English-trained ears, the word "should" sounds like the preface to a recommendation, not to a constraint. It invites the question "Well why should we??" from punks who don't want to be constrained without reason.

(Edited. The newer question sounds more natural.)

@marzer
Copy link
Contributor

marzer commented Mar 12, 2023

But to my English-trained ears, the word "should" sounds like the preface to a recommendation, not to a constraint. It invites the question "Why not?" from punks who don't want to be constrained without reason.

Interesting; to me "should" sounds much too strong for something that is a mere recommendation. Wonder if it's a dialectical thing. In any case, I am fearful of any normalisation ending up in the test suite because I won't be implementing it in TOML++ any time soon and am sceptical that it actually solves a real problem.

@ChristianSi
Copy link
Contributor

ChristianSi commented Mar 13, 2023

@eksortso Whoever gave you the impression that all normalization forms should work equally well? That's not how it works. The Unicode people themselves write in their Unicode Normalization FAQ:

Programs should always compare canonical-equivalent Unicode strings as equal.... One of the easiest ways to do this is to use a normalized form for the strings: if strings are transformed into their normalized forms, then canonical-equivalent ones will also have precisely the same binary representation. The Unicode Standard provides two well-defined normalization forms that can be used for this: NFC and NFD.

See how only the two shorter-named forms are listed here?

Go they on:

For loose matching, programs may want to use the normalization forms NFKC and NFKD, which remove compatibility distinctions. These two latter normalization forms, however, do lose information and are thus most appropriate for a restricted domain such as identifiers.

Clearly, TOML keys are not a "restricted domain" since quoted keys can contain arbitrary characters. So these lossy forms (which treat merely similar strings as if they were identical) are not appropriate for our use case.

@eksortso
Copy link
Contributor

@eksortso Whoever gave you the impression that all normalization forms should work equally well? That's not how it works.

@ChristianSi I never made that claim. Don't put words in my mouth. What I said was, it's not our position to say what normalization ought to be used. My restriction was that it must be standardized, so that its behavior is predictable. Maybe NFKC and NFKD are bad because they're overly aggressive, and they obviously would be, for strings that aren't identifier types like keys and table names!

@eksortso
Copy link
Contributor

I am fearful of any normalisation ending up in the test suite because I won't be implementing it in TOML++ any time soon and am sceptical that it actually solves a real problem.

@marzer, well, there are several places in toml.md right now that use the words "should" and "recommended." Scan through those instances, and check BurntSushi/toml-test to see if these situations are actually tested. And if they are, take it up with the folks who run the test suite.

You may need to do the same thing if normalization gets imposed. And I too don't want it forced, but I do want it recommended, in the most civil manner possible.

@marzer
Copy link
Contributor

marzer commented Mar 13, 2023

there are several places in toml.md right now that use the words "should" and "recommended."

I don't see why it can't be a "may", then? Why "should"? I'm OK with the word "recommended" because that is explicitly a recommendation (i.e. not a stipulation), and sounds similar to using "may". The word "should" can easily be interpreted as being stronger than that, as in "this is how it should be".

Aside from that, I haven't yet seen a satisfying argument for how this is actually any better than doing a simple ordinal comparison, frankly. Seems like an awful lot of intellectual chest-puffing without actually solving a real, extant problem. Again: Why not just say "key comparisons are ordinal, exercise caution"? It's a config language meant to be consumed in technical domains, after all, not a written language. There's already a contract between those writing the configs and those consuming them; surely unicode normalization concerns will be domain-specific. (And easily solved at the text-editor level where necessary?)

@marzer
Copy link
Contributor

marzer commented Mar 13, 2023

And if they are, take it up with the folks who run the test suite.

This is the wrong attitude. Language specs usually don't exist in a vacuum; TOML is no exception. Implementation concerns should be taken into account at the spec level, not at the test-suite level. Seeing as I'm the only implementer who regularly contributes to discussions here, I have to bang that drum.

well, I don't have to, but an implementer needs to contribute to discussions here. Otherwise it becomes design by academia/committee.

@arp242
Copy link
Contributor

arp242 commented Mar 13, 2023

there are several places in toml.md right now that use the words "should" and "recommended." Scan through those instances, and check BurntSushi/toml-test to see if these situations are actually tested. And if they are, take it up with the folks who run the test suite.

I'm not sure if "should" and "recommended" behaviours are tested from the top of my head; but if they are then they're relatively minor issues which are usually fairly easy to address one way or the other. The whole Unicode issue is much more major. If "recommend Unicode normalisation" ends up in the specification I'll add something to make these tests optional (probably opt-in, rather than opt-out), similar to how the TOML 1.1 tests are opt-in now.

Other than that, I mostly agree with marzer's comments, adding that I feel we can side-step the worst of the entire issue by as I outlined in #954. I also wouldn't be in favour of "should" or "recommended" language for something like this: it sounds like a compromise that's "fair" in the sense that it leaves everyone equally unhappy. It's too large of an issue to leave unspecified.

This issue has been debated a few times now in different threads over the last few months, and I don't really feel like doing it again. At this point it's fair to say the discussion is at a stalemate, and I don't really know how to advance it to reach a consensus.

@eksortso
Copy link
Contributor

eksortso commented Mar 13, 2023

Maybe we actually can strike a balance. We don't want to say so much that it leaves us straitjacketed in the future. But we can point out some things that aren't immediately obvious, because they do need said!

We can just use a single paragraph, at the bottom of the Keys section, just above the Strings section:

Because some keys with different code points look the same, use caution when writing such keys in your TOML documents. Applications and parsers may use NFC or NFD to normalize keys before making comparisons so that canonically equivalent key names are considered the same. Nearly all programming languages have tools to normalize keys, in case implementers wish to do so.

This one paragraph has no SHOULDs in it @marzer, makes note of the issue without enforcing a means of fixing it @arp242, advises using NFC or NFD to make comparisons @ChristianSi, and provides links to resources for those who'd want to delve deeper into the issues. That's all we would need to say. For now, at least.

@SnoopJ
Copy link
Author

SnoopJ commented Mar 13, 2023

I should probably have worded my initial request a little more clearly, since it's actually asking two related questions:

  1. Should the TOML spec point out the pit-fall?
  2. Should the TOML spec make any specific recommendations for implementers?

To me, (1) is an easy add. I personally am not heavily invested in the language used, so long as it warns users and implementers of the potential ambiguity.

It seems that (2) is a more contentious matter, and I don't have much to add to the preceding discussion, except to point out that TOML keys don't appear to meet the standard of "identifier" laid out by TR-31 (specifically because UAX31-R1 is not being satisfied by explicit choice of a "profile").

My main concern with (2) is end-users becoming sensitive to the details of whatever implementation(s) their data may pass through, and even a suggested normalization does not resolve this tension. Perhaps the best thing to do is to go in the other direction and require that implementions SHALL NOT normalize, i.e. double down on the binary representation and the users living with the edge cases just need to deal with that in their own applications?

I think y'all have a better sense of (2) than I do, but looking back, I think I should have filed this as two issues 😅

@eksortso
Copy link
Contributor

@SnoopJ Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons? Some implementers may prefer to keep the normalized key names in memory, to ease key lookups post-processing. But we would never require string values to be normalized, so the data stored in values will be preserved. I made no mention of normalizing string values (as opposed to key and table names) anywhere in the proposed paragraph. Some languages may normalize strings automatically, but those implicit actions fall outside of this specification, and may actually hamper efforts to enforce binary representation.

@SnoopJ
Copy link
Author

SnoopJ commented Mar 14, 2023

@SnoopJ Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons?

That does sound like it restricts the potential confusion to comparisons, rather than confusion about the data itself, which I think establishes the kind of "here be dragons" guide-rails that I had in mind when filing this issue. I'm afraid I don't know enough about implementations to have a particularly authoritative opinion, but the scope of the recommendation is definitely smaller in that case.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

@eksortso The paragraph you propose is much better, but you still haven't addressed why we're leaving the door open for any normalisation at all. Why do you want this?

The more I think about it, the more I realise there's really only two paths forward: It should either be a requirement with a specific algorithm, or we should explicitly rule it out with a note that we do ordinal (binary) comparison. Anything else will fragment the ecosystem (as @SnoopJ points out above).

And given my (still unaddressed) objections, ordinal is the only option. It is impractical for me to implement anything else.

Leaving room for normalisation is more and more seeming like a pet idea with no good reasoning behind it, and should be abandoned altogether, IMO.

Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons?

There is no good reason to do this in a technical domain, especially given the existing contract between config author and consumer.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

Perhaps the best thing to do is to go in the other direction and require that implementions SHALL NOT normalize, i.e. double down on the binary representation and the users living with the edge cases just need to deal with that in their own applications?

This is the correct approach.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

Something else nobody seems to have considered: if we leave the door open for normalised comparison of keys, we're closing a door for applications that may actually depend on them being compared ordinally. Developers can add normalization if they need it, but they can't take it away if we do it and they don't want it. Actively taking away the agency of developers by being 'too clever' tends to piss them off and send them elsewhere (see: YAML). This will become even worse if it's optional, because now they have to audition various implementations to find one that doesn't do the annoying thing they are trying to avoid, at the potential cost of picking one that is flawed/non-compliant in other ways. Fragmentation.

I can easily envision a scenario where someone would use TOML to configure some text parsing application (e.g. spam/swear filter, linter, regex) and they will likely wish to have normalisation be entirely handled by their application for more fine-grained control. We should not inadvertently block this workflow.

@ChristianSi, thumbs-down me all you like, but the fact remains that ruling-out normalisation in key comparison is the only portable and flexible path forward. TOML is not the right level at which to solve this problem.

@eksortso
Copy link
Contributor

@marzer, you say "SHALL NOT" normalize is the best approach, which would be a MUST, which would be a REQUIREMENT, which would run counter to the obviousness principle, since a few popular languages normalize their strings by default. You got two thumbs-down for a sound reason.

TOML is not a binary format. We shouldn't force it to be.

And if programmers want to go against default behaviors by forcing ordinal string comparisons on key names and giving their users a bad time, then the onus is on them to communicate their wayward intentions. That's not our problem, and that shouldn't be our testers' problem either.

And we're fighting over something that is the edgiest of edge cases! This is as painful as the internet gets. ñaña is ñaña, and that is a rare but explainable problem to have. Let's save our sanity and keep this in mind.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

Sorry, but I disagree for all the reasons I've already outlaid. Nothing you've said here satisfactorily counters my concerns. By doing this we're shutting out valid implementation concerns for purely ideological reasons that make no sense in a highly technical domain. I put it to you that you're the one trying to introduce an unexpected 'bad time', because your pet idea is inappropriate in this context. The rest of your argument is just facile wordplay.

And again, you still haven't explained why you're advocating so strongly for normalisation. As far as I can tell, you have no reason for it beyond "it is warm and fuzzy". Please. Why do you want this? It's a bad idea.

You got two thumbs-down for a sound reason.

By two people who aren't maintainers of TOML implementations. In this context I value those opinions very little. I also got a sound endorsement from someone who is, and that's far more meaningful to me.

And if programmers want to go against default behaviors

We decide the default. We pick a developer-hostile one at our peril.

This is as painful as the internet gets. ñaña is ñaña, and that is a rare but explainable problem to have.

Nothing that can't be solved by saying "exercise caution when using non-ascii keys", which is far simpler than an arbitrary and 'optional' solution to someone's pet issue.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

Ultimately, the onus is on you counter this:

Developers can add normalization if they need it, but they can't take it away if we do it and they don't want it. [...] I can easily envision a scenario where someone would use TOML to configure some text parsing application (e.g. spam/swear filter, linter, regex) and they will likely wish to have normalisation be entirely handled by their application for more fine-grained control. We should not inadvertently block this workflow.

I submit that you can't, and thus the entire foundation of your argument is flawed.

@arp242
Copy link
Contributor

arp242 commented Mar 14, 2023

The biggest problem I have is that in some environments the cost is too high. You can implement TOML 1.0 fairly easily without too much code. With required normalisation I'd have to import >30M dependencies for a ~4,000 line TOML library which currently has no dependencies at all. It's not a show-stopper, but I'm not too happy with that balance either as a TOML implementer or user (as in, user of my TOML library; i.e. application developer).

It was decided to have the current Unicode range in bare keys so a Unicode library/database isn't required for a TOML implementation, and now people are argueing normalisation (and thus a Unicode library/database) must be required. Seems rather inconsistent. Either go all-in on Unicode all the way or don't, but in-between has the disadvantages of both.

We can't implement normalisation for all keys as it's not backwards-compatible: we can only do it for bare keys. Unicode normalisation applied to quoted keys would make TOML 1.1 behave different from TOML 1.0. "Unicode normalisation must be applied for bare keys only, but not quoted keys" seems confusing and generally horrible. I don't recall if this was previously brought up (IIRC it wasn't), but this certainly needs careful consideration.

My favourite solution remains rewriting the specification so that normalisation simply isn't needed; the combining tilde in ñaña isn't something that needs to be allowed in bare keys. This makes the entire issue go away.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

Unicode normalisation applied to quoted keys would make TOML 1.1 behave different from TOML 1.0. "Unicode normalisation must be applied for bare keys only, but not quoted keys" seems confusing and generally horrible.

This was a concern I raised in #941. It wasn't addressed there, either.

@eksortso
Copy link
Contributor

This is certainly no pet project of mine. If you want to enforce binary equivalence, you go open a PR and fight for it yourself.

I'll stick with this.

Because some keys with different code points look the same, use caution when writing such keys in your TOML documents.

I'm through with this particular topic. Short of libelous claims made against me, I'm keeping quiet. Write your own PR.

@marzer
Copy link
Contributor

marzer commented Mar 14, 2023

I'm fine with that sentence. It doesn't mention normalisation at all, and to any reasonable interpretation implies that keys are compared 'simply' (i.e. ordinally).

libelous claims

Lol. Come on. I asked you to justify why you wanted normalised comparison a number of times and you never answered; don't be surprised if I form opinions in the absence of an answer. It's not an unreasonable question, especially in the face of overwhelming counter-arguments.

@eksortso
Copy link
Contributor

eksortso commented Mar 15, 2023

Lol. Come on. I asked you to justify why you wanted normalised comparison a number of times and you never answered

Normalization is going to happen in some cases, even if we don't ask for it. Not in C++ or Python. But whether it does, that requires a knowledge of the implementation language's default behavior during string comparisons.

@marzer You may need to restate some of your "overwhelming counter-arguments."

@ChristianSi
Copy link
Contributor

@eksortso: I like and support your proposal.

I think normalization is indeed not a pressing issue at least for the major European languages (I don't really know about others). In German, non-ASCII characters commonly occur in words such as Äpfel, Füße, Öl etc., but I think that all usual program write them using the one-codepoint representation; two-codepoint alternatives are allowed, but it would suspect them to be very rare. So implicitly word processors, editors etc. already do NFC normalization, meaning TOML parsers don't have to. And I suppose it's mostly the same in other languages using the Latin script.

In general, I think that data formats will pursue the same route and not expect normalization. One I checked is the JSON standard (ECMA-404), which says: "A name is a string." And: "A string is a sequence of Unicode code points". So, if the sequence of code points differs, it's a different string/key, even if it looks the same. Sounds reasonable.

Python also doesn't normalize strings implicitly, for example "Françoise" == "Françoise" evaluates to False (first uses NFC, second NFD). Neither will most other programming languages, I suspect.

So, let's not do it either.

@arp242
Copy link
Contributor

arp242 commented Sep 13, 2023

As mentioned, the big question is "are TOML keys identifiers or strings?" – I'm firmly in the "identifier" camp on this; that some (but hardly all) languages parse tables to a hashmap with string keys is an implementation detail.

Almost all language either 1) apply normalisation to identifiers, or 2) don't have these issues because combining characters aren't allowed in identifiers. Just like APFS, they don't do that for the fun of it or because people were bored.

People mentioned German, but I consider German to be "too easy" to serve as an example here; it's basically just Latin + umlaut on a very limited set of letters + a few other diacritics on some loanwords. It's not that different from English really. This is the case for most European Latin-based languages, and I consider those all "too easy"; what about Vietnamese? Greek? Korean? Bengali? Arabic? Other scripts?

Judging from One Stack Overflow answer at least Vietnamese outputs non-NFC on Windows, although that's from 2011 so who knows if that's still accurate.

In short:

  • I'm not sure if "no one replied so let's just hope for the best" is a good path. Actually, I'm pretty sure it's not – not many people read this.

  • To really have an informed opinion on all of this, real-world experience and knowledge with different systems and scripts seems required, rather than just reading some specification document on unicode.org. None of us seems to have this.

@marzer
Copy link
Contributor

marzer commented Sep 13, 2023

I believe @eksortso's proposed wording change is the right move.

As mentioned, the big question is "are TOML keys identifiers or strings?" – I'm firmly in the "identifier" camp on this

Treating them as identifiers and expecting them to be normalized accordingly implies a number of significant problems:

  • If we enforce any sort of normalization, we are dooming TOML in lower-level contexts where implementers can't rely on their language, OS or local install environment having the necessary machinery. Existing third-party solutions are enormous, and correctly implementing it manually is no mean feat.
  • Imposing normalization precludes situations where people may explicitly not want that. Developers can apply normalization to keys after-the-fact if they need it, but they can't take it away if it's done for them and they don't want it, since it's a lossy operation.
  • If we normalize bare keys, do we extend this to quoted keys? If so, then we're effectively changing the contents of a quoted string, which is counter-intuitive. If not, we're still stuck with the same problem of visually identical strings sometimes comparing inequal, only now one of them has quotes around it, which isn't any better.

Whereas treating them as strings and always comparing them ordinally means:

  • Sometimes keys that look the same compare differently? Big whoop. We have that problem now with quoted keys. At least if we enforce ordinal comparison, the behaviour will be uniform and predictable everywhere, rather than 'implementer's choice'.

All the C and C++ implementations I'm aware of do ordinal comparsion of keys. Higher-level language implementations seem to be a bit of a mixed bag, depending on what's available in their language (since they typically have the luxury to choose), but even they are going to be at the mercy of whatever normalization algorithm their language and/or standard library make available, and how current the underlying implementation actually is.

There also has been a temptation to make key normalization optional, but that just means we have a fragmented ecosystem. Nothing in the spec should be optional, IMO. When a change is controversial enough that people are even tempted to say the "O" word, that change should be a non-starter.

Thus, amending the spec to explicitly clarify that "yes, keys are treated as strings, and yes, they are compared ordinally" is the only portable and robust path forward.

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 17, 2023

@arp242: Why are you in the "identifier" camp? I'd rather say keys are like keys in a hashmap, since that's precisely TOML's conceptual model ("Tables (also known as hash tables or dictionaries) are collections of key/value pairs", says the spec). However, few (none?) languages seem to normalize keys in such hashmaps, as we've noticed.

As for normalization requirements of different languages: yes, maybe normalization is indeed necessary to make keys in (say) Vietnamese or Arabic work well, who knows? None of us, it seems. Still, I think there are ways around that. One would be a preprocessing step: "If your keys are in a language where a lack of normalization is a problem, please make sure to only pass properly normalized input to your TOML parser."

Another would be to weaken @eksortso's proposal to make binary comparison the default behavior, but allow parsers to optionally normalize if requested by the user to do so. I would approve that, as I think if we have sane defaults than configurable deviations from that defaults are entirely acceptable.

@arp242
Copy link
Contributor

arp242 commented Sep 17, 2023

Keys don't need to be decoded to hashmaps; at least in Rust and Go it's common to unserialize to a struct, and this probably applies to other languages as well. Maybe I'm biased because I spend most of my time with Go, but the typical pattern is:

type Doc struct {
    Key string
}
var d Doc
toml.Decode(`key = "value"`, &d)

You can scan to a map, but it's uncommon. This is the pattern I'd generally prefer in most languages to be honest (even e.g. Python or Ruby, by e.g. scanning class attributes, although I don't know what the current possibilities for this are, it's the type of thing I'd consider writing a patch or my own library for – I suppose this is part of the great "static vs. dynamic" debate).

I'm not a huge fan of the explicit mention of "hashmaps" in the spec (even though it's been there right from the start, this is one of the areas where TOML shows it roots in Ruby and dynamic typing, which isn't necessarily a good thing).


As I mentioned in some earlier comments, I'm not a fan of enforcing normalisation. I'm just pointing out that "🤷" is not the best way either. I think I have a solution I will submit as a proposal sometime in the next few days.

@marzer
Copy link
Contributor

marzer commented Sep 17, 2023

This is the pattern I'd generally prefer in most languages to be honest [...] e.g. scanning class attributes

This is competely impossible in some languages because it requires some level of reflection that simply isn't there. C and C++, for example, have no built-in facility to do this (and are unlikely to get one in the next decade or so). You can implement it in C++ using user-supplied specialization machinery, but it's typically quite complex and error-prone.

Maybe I'm biased because I spend most of my time with Go

Indeed. Would you mind commenting on my concerns RE lower-level languages and/or environments? Because as it stands you seem to be ignoring them. We do so at our peril.

I'm just pointing out that "🤷" is not the best way either.

I should point out that I have not been "meh" about normalization at any point. I've argued quite stridently against it, and have provided pretty thorough reasoning for doing so. I'm not going to speak for others, but I think you're doing this discussion a disservice with that assessment.

Also, to your earlier comment:

To really have an informed opinion on all of this, real-world experience and knowledge with different systems and scripts seems required, rather than just reading some specification document on unicode.org. None of us seems to have this.

TOML++ is written purely in C++, and is used on Windows, Mac OS, iOS, many flavours of Unix, Android, Emscripten, and bunch more esoteric things I hadn't even heard of before releasing it. Additionally, in the two last years or so has picked up quite large user-bases in China, Japan, and the Koreas. I've had bug reports where people are sending me TOML documents with Chinese text in the strings and comments. I've had bug reports about localization issues in German, Italian, Greek. I think at this point I've got a pretty good knowledge base on the matter. I'll tell you what I've never seen? A bug report where normalization would have fixed it. I realize that's anecdotal, but your assertion that none of us has real-world experience here is not correct.

@arp242
Copy link
Contributor

arp242 commented Sep 17, 2023

I am not in favour of adding normalisation @marzer, partly because I want TOML to be implementable "from scratch" without dependencies by any competent programmer. I have said this multiple times over the last years, including in my last comment. Your response is as if I strongly disagree with you, but I don't: we agree that normalisation shouldn't be added to TOML, and have agreed on this for a long time.

This is the pattern I'd generally prefer in most languages to be honest [...] e.g. scanning class attributes

This is competely impossible in some languages

"Most", not "all". "If the language supports it" is an obvious qualifier, but many (not all) languages do.

My point was just that "a table maps is a hashmap" is not always true as such, and other methods/mappings exist, and are commonly used. The question was "why are you in the "identifier" camp?" and this is my response.

I'm just pointing out that "🤷" is not the best way either.

I should point out that I have not been "meh" about normalization at any point. I've argued quite stridently against it, and have provided pretty thorough reasoning for doing so. I'm not going to speak for others, but I think you're doing this discussion a disservice with that assessment.

Well, "Six months with no response suggests it's not a pressing issue and not a regular occurrence. Unless we haven't connected with a non-English-speaking audience, and we know that isn't true" certainly seemed like "no one replied so 🤷" to me. What will be the effects for Korean, Vietnamese, etc? I don't know. I suspect none of us do. There are some indications that it may be problematic. "No one replied" is not identical to "we have investigated the matter and/or consulted with experts, and are confident that [...]".

@marzer
Copy link
Contributor

marzer commented Sep 17, 2023

What will be the effects for Korean, Vietnamese, etc? I don't know. I suspect none of us do.

Hah. Interestingly I edited my erlier message to somewhat cover this. I have some experience here - see above.

Your response is as if I strongly disagree with you

I'm responding as if you still want to keep normalization on the table, when IMO it's totally untenable, that's all.

I could tolerate it being an 'optional' part of the spec, but I don't want people making bug reports to me because "this higher-level language implementation does it, why doesn't yours?", or "oh yeah TOML++ passes the toml-test suite as long as you turn off this optional part". Because, to be clear: I'm absolutely not implementing it, regardless of what ends up in the spec. It's a total non-starter in a cross-platform C++ implementation.

@arp242
Copy link
Contributor

arp242 commented Sep 17, 2023

I'm responding as if you still want to keep normalization on the table, when IMO it's totally untenable, that's all.

To clarify: it's not "on the table" as far as I'm concerned. But I'm also not happy with how things are. This is the intrinsic difficulty with all of this that I talked about before, because none of the options are "clearly good".

or "oh yeah TOML++ passes the toml-test suite as long as you turn off this optional part".

One (out of several) of the reasons I'm not really happy for "implementation defined behaviour" (any implementation defined behaviour) is exactly because it will complicate everything in toml-test. Not just that I will need to implement something for this, but also because it complicates things for all implementers who want to use toml-test. We've already seen this with support for the TOML 1.1 draft, which is "necessary complexity", but I don't think anyone is helped by more of this.

Hah. Interestingly I edited my erlier message to somewhat cover this. I have some experience here - see above.

Right so – I didn't see that 😅

I've seen comments in many languages, but I can't recall seeing non-ASCII key names (in quoted keys, because that's your only option now); all I'm going is on things like that Stack Overflow answer I mentioned, but also things like Apple removing normalisation when they moved from HFS+ to APFS, only to reluctantly bring it back because it caused issues for people (on the other hand, I believe NTFS/Windows doesn't do this and never has, so does this cause issues there? Maybe Windows does some stuff at other layers that macOS doesn't which makes this less of an issue?)

@marzer
Copy link
Contributor

marzer commented Sep 17, 2023

This is the intrinsic difficulty with all of this that I talked about before, because none of the options are "clearly good".

Yeah, I get that. This is the main reason I'm in favour of explicitly codifying ordinal comparsion - it's (mostly) standardizing existing practice, and is the only truly portable way of handling the issue. I realize it still has caveats, but then there are literally zero solutions to this problem that are caveat-free, so IMO we should pick the most portable of the various evils we have available to us.

I believe NTFS/Windows doesn't do this and never has, so does this cause issues there? Maybe Windows does some stuff at other layers that macOS doesn't which makes this less of an issue?)

You're right, windows doesn't do this anywhere in NTFS (or any other filesystem, for that matter):

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. ref

...though C# and higher-level language APIs might. In practice I've never heard of it causing significant issues (though I'm sure it has for someone, somewhere). My point is mainly that so far I have not had any bug reports where normalization would have solved the issue - people combining low-level programming languages with string handling pretty quickly become aware that strings are just bags of bytes, so I doubt it's a real issue for people other than just something they need to be aware of more generally anyway.

@erbsland-dev
Copy link

erbsland-dev commented Sep 17, 2023

I'm somewhat perplexed by the ongoing discussion. The original post and title suggest that the main focus is on specifying that parsers should use binary comparison for keys, while also providing warnings about potential side effects.

From what I've gathered in the comment thread, there's a general consensus that implementing normalization requirements is impractical. This is largely due to the undue burden it would place on parsers by introducing substantial dependencies, for a benefit that is marginal at best. Concurrently, there's agreement that the method of key comparison should be explicitly defined, which paves the way for a specific mandate on code-point comparisons for keys.

I concur with both of these observations. Adding a bulky dependency like ICU to a parser introduces unnecessary complexity, potential security risks, and bloat. Furthermore, a well-defined standard that eliminates ambiguity is essential; it ensures that any TOML document will be parsed uniformly across different parsers that adhere to the standard.

I also appreciate the proposed text by @eksortso as a useful clarification. However, I'd like to suggest a slight alteration in the wording of the warning:

Keys are considered identical if their code-point sequences match.

Exercise caution when using keys containing composite characters; they might appear
identical but differ in code-point values. If your use case involves such keys,
it's advisable to normalize both the TOML document and the corresponding keys
in your code before parsing.

# prénom = "Françoise", using NFC
"pr\u00e9nom" = "Françoise"

# prénom = "Françoise", using NFD
"pr\u0065\u0301nom" = "Françoise"

In my earlier comment, I also added a text about the behaviour of parsers into the warning. But, I realised this was a mistake, so the text above is a warning directed at the user of a TOML parser.

What baffles me is the ongoing debate over the pros and cons of normalization, especially when there appears to be agreement on the core issues.

arp242 added a commit to arp242/toml that referenced this issue Sep 22, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and ANY solution is a
trade-off. That said, I do believe some trade-offs are better than
others, and after looking at a bunch of different options I believe this
is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is the strongest argument in favour of this and the biggest
  improvement: we can't really do anything wrong here in a way that we
  can't correct later. Being conservative is probably the right way
  forward.

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work", but "this
  character works fine, but this very similar doesn't". This shows up in
  a number of things:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  From the user's perspective this seems like a bug in the TOML parser.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

[1]: Aside: I encountered this just the other day as I created a TOML
     file with all UK election results since 1945, which looks like:

         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]

     That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just
     wrote it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@arp242 arp242 linked a pull request Sep 23, 2023 that will close this issue
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@wjordan
Copy link

wjordan commented Sep 26, 2023

One approach I haven't seen discussed here is that instead of recommending/requiring transparent normalization of keys or asking the user to simply 'exercise caution', the spec could recommend parsers implement configurable warnings ('diagnostics' in the UTS55 lexicon) to flag potential security issues, which may be a more broadly defined and open-ended measure of irregularity/confusability/potential spoofing than simply requiring a strict normalization form.

UTS55 mentions the Rust compiler as an example, which emits warnings about the use of identifiers outside the General Security Profile, as well as confusable identifiers more generally.

As another industry example implementing this approach, the GCC compiler includes a -Wnormalized warning that flags non-NFC identifiers by default:

In ISO C and ISO C++, two identifiers are different if they are different sequences of characters. However, sometimes when characters outside the basic ASCII character set are used, you can have two different character sequences that look the same. To avoid confusion, the ISO 10646 standard sets out some normalization rules which when applied ensure that two sequences that look the same are turned into the same sequence. GCC can warn you if you are using identifiers that have not been normalized; this option controls that warning.

There are four levels of warning supported by GCC. The default is -Wnormalized=nfc, which warns about any identifier that is not in the ISO 10646 “C” normalized form, NFC. NFC is the recommended form for most uses. It is equivalent to -Wnormalized. [...]
You can switch the warning off for all characters by writing -Wnormalized=none or -Wno-normalized. You should only do this if you are using some other normalization scheme (like “D”), because otherwise you can easily create bugs that are literally impossible to see.

@arp242
Copy link
Contributor

arp242 commented Sep 26, 2023

Continuing from #990; @marzer said:

I disagree with you that "only 17kb of memory" is fine; for some embedded environments that's absolutely a deal-breaker

How much memory does toml++ use now? I ran example/example.toml via example/simple_parser.cpp, and it uses ~210K of memory:

% clang++ -std=c++17 -O2 simple_parser.cpp
% valgrind --tool=massif --time-unit=B ./a.out

If I reduce example.toml to just one line (the title at the top) it uses ~190K, so it's not the size of the file.

But maybe I'm using it wrong? I didn't look beyond just running valgrind like this.

tomlc99 uses a lot less memory though: ~14K on the full file, or ~5K for a single-line file (via toml_cat.c).


I'm not really concerned about security @wjordan, beyond LTR control codes and the like. I suppose someone could make it appear a key was set, or make it appear something was commented out, but applications should really reject unknown keys – typos are also a security risk.

@marzer
Copy link
Contributor

marzer commented Sep 27, 2023

How much memory does toml++ use now? I ran example/example.toml via example/simple_parser.cpp, and it uses ~210K of memory

A parsed document doesn't use all that much, since it's just a set of nested key-value pairs with some additional metadata, which aren't particularly large (indeed you saw this yourself when you reduced the size of the parsed TOML - the overall memory usage went down by some amount, but not massive compared to the total use of the application). The main additional cost is the source region information for each key and value (start line, start column, end line, end column). If the library didn't store that then the KVP's would obviously be quite a bit leaner.

It'll use more memory than that during parsing, of course, because all of the state required during the parse, but there isn't much to be done about that. There are steps I could take but they'd come at the expense of parse speed and/or diagnostics.

Probably worth noting that your test there is likely also demonstrating the base memory cost of C++ itself (runtime, static initializers, streams, exception handling stuff, vtables, RTTI, etc) - there's a lot of hidden memory costs that C (C99 especially) simply doesn't have.

All of this is moot, though; an additional 17KB of memory is a lot for a feature that solves a problem we aren't even sure anybody really has in TOML and has a much simpler solution available: "keys are strings, they're compared ordinally, beware". If we aren't going to do that, then your proposed solution in #990 is reasonable to me.

@marzer
Copy link
Contributor

marzer commented Sep 27, 2023

@wjordan While tempting, I do not think comparsions to what Rust or GCC does are useful; A compiler for a programming language is rightly going to jump through a great many hoops to provide good diagnostics with lots of options; compilers are complex, as are programming languages. TOML is neither of those things, and TOML documents are not code. The "M" in TOML stands for minimal - putting a simple "hey, you can use some unicode in your keys, but there's some limitations" in the spec seems much more inline with that ideal to me.

Honestly if I had known when I proposed #687 that people would even think of stepping outside the realm of treating keys as anything other than just buckets of bytes, I do not think I would have bothered 😅

@arp242
Copy link
Contributor

arp242 commented Sep 27, 2023

When I first voiced some concerns on the original Unicode PR I was basically told "we discussed this already, shut up". I was not very impressed by this, as I felt (and still feel) my concerns were valid, and were not really discussed before. But at the time I was much newer to commenting here, and I didn't want to be a nuisance, so I just unsubscribed from the PR.

It should come as no surprise to you that I've since come to regret just giving up, and to be honest I rather feel I was badgered in to shutting up by a clique who didn't accept outside commentary.

All that was 2 years ago, and my involvement has been frequent enough that I've become part of the "clique", I suppose. However I see the same thing happening here again.

If we want to have an open process to develop TOML then you also need to accept that sometimes new people come in and make (good-faith) arguments on matters that have already been discussed before. I see no real problem with that. Sure, this can perhaps be frustrating at times, but "don't say anything" is always an option.

And IMHO "it requires 17K of memory" is a new argument, because then we can have a discussion about "how much memory is too much memory?" 10K? 5K? 1K? That's also why I ran my quick and imperfect benchmarks. "Normalisation is too expensive" was just accepted as an axiom before, but how "expensive" is "too expensive", exactly?

It's not like that many people regularly contribute here, about 5 or so. The only reason I stuck around is because I'm maintaining a TOML implementation, and certainly wouldn't have come back if it wasn't for that. Just sayin'

And it's not like there's been that strong of a consensus on any of these topics; that's why we're having this discussion in the first place, right?

So in short, I'm not a fan of the implication that wjordan isn't allowed to make their case.


I do not think comparsions to what Rust or GCC does are useful or helpful; An entire compiler for an entire programming language is obviously going to jump through a great many hoops; compilers are complex, as are programming languages. TOML is neither of those things, and TOML documents are not code.

Indeed, but this actually makes the problem more pressing, rather than less. For programming languages I'm okay with assuming some minimum knowledge about character encodings and how all of this works. For TOML I'm a lot less okay with that.

@marzer
Copy link
Contributor

marzer commented Sep 27, 2023

I apologise @wjordan, that was not my intent. We've been talking about this for about two years and it's been mostly the same circular discussions, that all follow the same path. There has been a frustrating inevitability to it. You just happen to be the most recent one to spark the normalization discussion, and I took my frustration out on you. I will edit my comment to remove that.

@marzer
Copy link
Contributor

marzer commented Sep 27, 2023

@arp242 RE the memory stuff, that's not my main problem with normalization, but it is a problem with it. I can't answer your question though. How much is too much? Who knows? All I can tell you is that a library that used X KB one version, then X + 17 KB the next version, for the addition of a feature they are more-than-likely not using, would be a problem for a lot of embedded folks. They'd be free to stay on an older version, of course, but then they're limited to older TOML.

My main beef with it is that there's likely to be people that explicitly do not want it (e.g. they may be configuring some text parsing app, or doing regex stuff), and imposing it would be a blocker for that use-case. In that way feels to me to be a bit too "magic", too YAML-y.

@eksortso
Copy link
Contributor

I apologise @wjordan, that was not my intent. [...] . You just happen to be the most recent one to spark the normalization discussion, and I took my frustration out on you. I will edit my comment to remove that.

I'm sorry that this happened. Even though I'm glad new people are interested in TOML and encourage engagement here, I often forget that we do have a regular share of topics that have been talked about vigorously for years. It's one of the reasons that I've recently avoided @'ing folks here, against my instincts.

We really gotta start an RFC-like process on the wiki. It would help us summarize controversies for newbies who want to contribute but don't want to argue so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants