Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that comments never affect the tables produced by parsers #950

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

eksortso
Copy link
Contributor

In the discussion on #603, it became clear that no restrictions are placed on parsers that could further modify the hash tables that they generate based on included comments. The purpose of this PR is to assure that parsers will never modify a TOML document's resulting hash table due to the presence or contents of the comments contained within the document.

The change to toml.md will prevent a certain class of abuses of the TOML syntax, without prohibiting comment-preserving parsers from doing their thing (as those comments are not part of the resulting table but merely reference it from locations in the source document).

(This PR replaces #948.)

@ChristianSi
Copy link
Contributor

I'm not convinced that this is needed. Could you give a specific example of how a comment could trigger changes to tables without violating the current spec?

@marzer
Copy link
Contributor

marzer commented Jan 11, 2023

My read of this is to explicitly codify that vanilla TOML does not allow for things like comment-based #pragmas (and other 'meta' directives), and a document relying on those sorts of things is no longer conforming.

Or, put differently, a conforming parser is always expected to ignore those constructs entirely, and never change the way the document is interpreted.

@eksortso
Copy link
Contributor Author

@ChristianSi Basically, what @marzer said, though it's not about the documents as much as the parsers that read them. You could write this test: if you walk through a generated table, a comment's text should not appear as a key, a table name, or a string value, assuming it wasn't introduced elsewhere via legitimate means. If a parser can preserve comments, then those comments could not be identified via a simple walk through the resulting table. They would need to be stored in a way that could not be conflated with keys, values, or table names. End-users could read comments and tweak the tables, but that would be outside the scope of parsers. And any form of extension syntax that an otherwise conformant parser would recognize, obviously, falls outside this restriction.

Here are two simple examples, which aren't all that clever, that could be read by consumers (i.e. parsers and their end-user programs) and manipulated in excessively clever ways. I talk a bit about how those clever ways could be implemented legitimately later on.

First is, how could we assure that a float is interpreted as a decimal, so that exact representation is preserved? The bad way is to comment that it's a precise decimal, and have the parser read that comment and emit a different value type than expected for floats, based on its representation in the TOML file.

# Not Recommended
one_quarter    = 0.25  ##decimal
nearly_a_third = 0.3333333333  ##decimal

Second is something that I came across at work one day. By default, keys and subtables are not ordered in TOML tables. But what if you wanted to put subtables in a sequence defined by the order in which they're described?

# Not Recommended
[report_tabs]  ##sequenced
[report_tabs.one]
[report_tabs.two]
[report_tabs.skipafew]
[report_tabs.ahundred]

The explicit [report_labs] isn't required, except that someone's clever parser will try to keep its subtable names in the stated order.


There doesn't seem to be a common way across languages or parsers to assure that decimal values are preserved. I am aware that the tomllib standard library module in Python 3.11+ (and the external tomli module for prior versions) offers a parse_float parameter in its load functions that defaults to float and can be set to Decimal (from the decimal module). It's a catch-all approach; you can't get both floats and decimals back from the same call. Whether a per-value means of representing binary64 values and decimal values in TOML remains an open question.

parsed_toml = tomllib.load(toml_file, parse_float=Decimal)

Sequenced subtables look like they'd work, but the only thing in TOML that actually guarantees a proper ordering is arrays. Something like the following isn't as pleasant-looking, but it gets the job done right. The names of the subtables from before are now values to name keys in each table element, and the order is guaranteed to be one, two, skipafew, and ahundred.

[[report_tabs]]
name = "one"
[[report_tabs]]
name = "two"
[[report_tabs]]
name = "skipafew"
[[report_tabs]]
name = "ahundred"

@ChristianSi
Copy link
Contributor

Hmm, TOML as doesn't guarantee any specific ordering of key/value pairs in a table, but neither does it prohibit it. For example, in Python (since v3.7) dict objects preserve iteration order. So a TOML parser that relies on them (as most will) will naturally allow iteration over key/value pairs in insertion order too. Is that in violation of the spec? No, of course not! Parsers aren't expected to preserve iteration order, but neither are they required to scramble the order.

Now, for "pragmas" such as ##decimal or ##sequenced. Well, of course they aren't standardized by the spec, but do we really have to forbid them? Anyone relying on them should realize that they tie their files to a specific parser. But are they a problem for us? Do they harm the TOML standard? Personally I don't think so and so would tend to consider this a "WONTFIX".

@eksortso
Copy link
Contributor Author

Hmm, TOML as doesn't guarantee any specific ordering of key/value pairs in a table, but neither does it prohibit it. For example, in Python (since v3.7) dict objects preserve iteration order. So a TOML parser that relies on them (as most will) will naturally allow iteration over key/value pairs in insertion order too. Is that in violation of the spec? No, of course not! Parsers aren't expected to preserve iteration order, but neither are they required to scramble the order.

@ChristianSi I never said that ordered subtables were in violation of the spec, just that their order is not guaranteed. Most of the examples that we discuss are Python-centric, but even if Python guarantees insertion order on dicts, there's no guarantee that the parser will insert subtables in that order that they are encountered in the TOML doc. Nobody can assume that text order will match insertion order, and anyone coding with that assumption is relying on implementation details alone, and that can be a dangerous practice.

And in those languages that provide iteration over hash tables but have no such order guarantee, there may be future demand for such a feature. Which their parsers could provide, certainly. So a suggestion to ensure such an ordering at the level of the TOML spec might never be encountered.

I just chose sequence order and decimal values as simple examples of potential comment abuse.

Now, for "pragmas" such as ##decimal or ##sequenced. Well, of course they aren't standardized by the spec, but do we really have to forbid them? Anyone relying on them should realize that they tie their files to a specific parser. But are they a problem for us? Do they harm the TOML standard? Personally I don't think so and so would tend to consider this a "WONTFIX".

Yes, they harm the TOML standard. They shouldn't be ignored.

What I'm trying to prevent is the widespread use of TOML comments as pragmas. Comments are meant to be ignored by parsers, and they shouldn't carry any more weight than that. In the long run, we must have that assurance.

If TOML parsers are going to implement extension features, then I'd much rather see them do so by introducing changes to the TOML syntax in their implementations. This would lead to two useful states, depending on the viability of the changes.

If they're unpopular changes, they'll be limited to the single parser that implemented them, and no other parser will be able to read the documents written with the nonconformant syntax. End-user programs reliant on such features will be bound to their oddball parsers, such that variations of narrowly scoped deviations from TOML will be isolated, and rightly so.

And if they're popular changes, then the extended usage of the new syntax in the wild would guide the development of future iterations of the TOML standard, which would put the widely adopted syntax into a larger perspective and prevent fragmentation of TOML's user base. Parsers may ignore the new syntax and fall out of conformance, and rightly so. But they'd see enough reasons to implement these features for themselves, and not ignore them like how they would ignore comments.

Do you see why I am proposing this change now?

toml.md Show resolved Hide resolved
@ChristianSi
Copy link
Contributor

ChristianSi commented Jan 16, 2023

I get the purpose of this PR, I but I still don't see the need for it. Are there any pragmas in actual use anywhere in the wild, with some parsers actually supporting and honoring them?

If so, I'd like to see and study them, to see whether they are actually harmful or rather harmless and maybe even suggestive about which additional features TOML should come to support.

If there aren't any, I don't see any cause for action at all. Lets not act prematurely. If there are no pragmas in real life, this whole thing is a non-issue. If there are any, they might teach us something.

Personally, I think the best attitude is "from the viewpoint TOML spec, comments have no meaning", but without forcing every TOML parser to necessarily adapt the same attitude.

@arp242
Copy link
Contributor

arp242 commented Jan 16, 2023

I get the purpose of this PR, I but I still don't see the need for it. Are there any pragmas in actual use anywhere in the wild, with some parsers actually supporting and honoring them?

I don't know if there are any parsers that include comment-based pragmas and I agree it's not needed, but I think a single short line is helpful as it clarifies that comments are always "pure comments". I'm not too worried about some implementation using wild pragma statements, it's just a clarity thing.

We could maybe also tweak the opening line a bit from:

A hash symbol marks the rest of the line as a comment, except when inside a string.

to something like:

A hash symbol outside a string marks the rest of the line as a comment, which must always be ignored for parsing purposes.

Or something along those lines (my suggestion could be improved). That would serve the same purpose.

Co-authored-by: Pradyun Gedam <3275593+pradyunsg@users.noreply.github.com>
@eksortso
Copy link
Contributor Author

I'm guessing, kadir, that your GitHub tool has bugs.

@pradyunsg Can we mark the last three messages (including this one) as off-topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants