Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible positional restrictions on table definitions #446

Open
alexcrichton opened this issue Feb 3, 2017 · 24 comments
Open

Possible positional restrictions on table definitions #446

alexcrichton opened this issue Feb 3, 2017 · 24 comments

Comments

@alexcrichton
Copy link
Contributor

Hello! I'm currently in the process of rewriting a TOML parser and have come
across a curious aspect of TOML. I believe that a document such as this is
valid:

[a]
key = 1

[unrelated-table]
key = 1

[a.b]
key = 1

That is, you can define a table across many portions of a document. There's
certainly nothing wrong with this per-se, but it means that a parser decoding
this document is forced to keep all intermediate tables in-memory or in some
side data structure. For example if you'd like to decode the table a from
this document you're forced to parse unrelated-table and store it in memory
elsewhere to come back to later (if requested).

This is definitely a super niche concern, of course! TOML is intended for
configuration, which should never be something like gigabytes large. On the
other hand though this does make writing a parser a little bit more difficult
in some situations, and to me at least I'd subjectively find such a definition
relatively wonky.

I wonder, what would others think about adding a clause such as this to the
spec?

Although one table may be defined across multiple sections (if its keys are
themselves tables) all sections that define a table must be contiguous. For
example this is invalid:

[a]
key = 1

[unrelated-table]
key = 1

[a.b]
key = 1

Similarly, this is also invalid

[a.sub-table]
key = 1

[a.unrelated-table]
key = 1

[a.sub-table.b]
key = 1

This does make the otherwise very simple spec somewhat more complicated, and also somewhat more arbitrarily. If this were only about parser performance I don't think I'd bother with the suggestion, but subjectively interpreting this as odd prompted me to file this report!

@BurntSushi
Copy link
Member

I think probably the biggest concern I'd have with this is that it's a breaking change. It could make an otherwise valid TOML file invalid. :-(

@alexcrichton
Copy link
Contributor Author

Ah yes good point, I forgot to mention that!

I wasn't quite sure whether the 0.4 -> 1.0 transition (assuming that happens) is just a clarification as opposed to a breaking release. If no breaking changes are being made then I think the ship has sailed!

@lmna
Copy link

lmna commented Feb 3, 2017

Suggested change would make TOML more obvious and friendly to a human reader, and more consistent as well.

  1. "Obvious". Table-followed-by-some-other-completely-unrelated-table is similar to a topic change in a talk: human would naturally conclude that her companion has nothing more to say on the first topic. Thus it is surprising that related parts of the first table are allowed to be scattered all over the document.

  2. "Friendly". How would a human make sure that she has read all contents of a specific table? The easiest way is to find the first entry of the table and then read through the document till unrelated stuff. Unfortunately, this way is unreliable now, thus one has to use search in routine and excessive manner. Users of TOML 1.0 could be free of this annoyance!

  3. "Consistent". Let’s take a look on example of invalid document from the spec:

     # Doing so is invalid
     
     [a]
     b = 1
     
     [a]
     c = 2
    

    This example could be interpreted like "reopening a table is forbidden". So it would be consistent to forbid "reopening" a table with a sub-table as well.

@pradyunsg
Copy link
Member

If TOML 1.0 shouldn't be breaking stuff that's already there in the wild, I suggest making this a post-1.0 item.

Personally, I like this idea and getting it in 1.0 might be a good idea too.

@alexcrichton
Copy link
Contributor Author

@pradyunsg yeah I'd be totally fine making this post-1.0, I wouldn't consider this critical at all!

@pradyunsg
Copy link
Member

@mojombo Your thoughts on this?

@mojombo
Copy link
Member

mojombo commented Dec 6, 2017

I'd very much like to keep 1.0 backwards compatible with 0.4. In the likely event that dotted keys go in, allowing for this type of looseness might actually be desirable in some cases. I think we need more experience with it before we should lock it down, so let's roll with the status quo for now and re-evaluate for later versions.

@mojombo mojombo closed this as completed Dec 6, 2017
@pradyunsg
Copy link
Member

pradyunsg commented May 11, 2018

(noting this here so that I don't fail to mention it; this issue can be reopened as a post-1.0 issue)

I can imagine that TOML changing table tracking logic to be something like:

  • a table can be "open", "closed" or "unexplored".
    • valid transitions: "unexplored" -> "open" -> "closed"
    • i.e. "closed" a table can not be opened again
  • BOF opens root table, EOF closes any open tables
    • root is the super-table for top level tables
    • can open a new table only if it's super-table is open
  • a.b.c.d opens the table a, a.b, a.b.c (if they aren't already open)
  • a.b = {c = 1, d = 2} opens a, a.b and then immediately closes a.b because that's defined inline.

The exact same rules apply regardless of the use of dotted keys.

A demo:

# 'root' table is opened
key = "value"
assignment.first.key = "value"   # opens assignment, assignment.first
assignment.second.key = "value"  # closes assignment.first; opens assignment.second
# INVALID here: assignment.first.another = "value"
mytable.key = "value"  # closes assignment.second, assignment; opens mytable
# INVALID here: assignment.second.another = "value"
# closes mytable and 'root' table

@eksortso
Copy link
Contributor

@pradyunsg How would "unexplored" table status be used? The only application that I could imagine it being used for are undefined supertables. That is to say, if a.b.c.d is specified before a, a.b and a.b.c are mentioned, then wouldn't it be clearer to say that a and a.b are unexplored, rather than open? (If a.b.c.d is a table, then we'd treat a.b.c the same way that we'd treat a and a.b.)

I do see the value in doing this sort of thing. Since tables can (currently) be specified in almost any order, an unexplored table status acknowledges those tables' existence without applying restrictions on their future application that an open status would suggest.

@pradyunsg
Copy link
Member

How would "unexplored" table status be used?

Refers to all tables that hasn't been mentioned in the file yet.

That is to say, if a.b.c.d is specified before a, a.b and a.b.c are mentioned, then wouldn't it be clearer to say that a and a.b are unexplored, rather than open?

That'll make assignment.second.another in the example above valid. I'm not sure if that's preferable.

@eksortso
Copy link
Contributor

@pradyunsg I don't see how. When assignment.second.key was previously defined, that implicitly opens the table assignment.second. When mytable.key is defined, then implicitly mytable is opened and assignment.second is closed. So assignment.second.another is still invalid there, according to the scoping logic.

I don't know if I've mentioned this before, but I like to think that a table's keys are always defined across a continuous range of lines (not counting parent or child tables). This logic provides for that. If it doesn't become an adopted scoping standard, then I would continue using it as a style preference.

Though I think you're proposing something that is more restrictive. You want it so that you "can open a new table only if its super-table is open." This would only allow subtables to be defined within the scope of their ancestors. I would only apply that rule to inline tables.

@pradyunsg
Copy link
Member

pradyunsg commented May 16, 2018

This is what I get for responding late at night. I meant, "this would allow for an assignment.third.key where assignment.second.another is in the example above" (which might be an incorrect statement too).

That said, I guess we can hold off this discussion until after 1.0 is done with. :)

@ChristianSi
Copy link
Contributor

ChristianSi commented May 24, 2018

I think it would be a VERY BAD idea to treat this is a "post-1.0" issue. It's a breaking change and hence it should happen as soon as possible – preferably before a new 0.x version is released, and CERTAINLY before 1.0. After 1.0 is released, more and more files might rely no such such restrictions being present, and there will be a great of deal of hesitation (and for good reasons!) about making any non-essential breaking changes.

If we want to insert such restrictions (and I DO think it's a good idea), we should do so SOON. Currently dotted keys are quite new and no tagged TOML version has them, so breakage will be low. But the longer we wait, the harder it becomes.

@sgarciac
Copy link
Contributor

sgarciac commented May 24, 2018

I think there are legitimate reasons to want to order tables in such way: you might want to prioritize access to some keys that belong to different tables by keeping them on top of the TOML file.

For example, think of the following configuration file for an hypothetical application that tests a web app:

[api_server]
hostname = 'api.dev.whatever.com'

[frontend_server]
hostname = 'www.dev.whatever.com'

[api_server.details]
auth_path = '.....'
ping_path = '.....'
# potentially very long list of details

[frontend_server.details]
homepage_path = '.....'
login_path = '.....'
# etc...

If I want to reuse the same file to test different deployment environments by changing only the hostname of the servers, it makes sense to keep that information on top of the file, separated from the rest of the server's configurations.

@guai
Copy link

guai commented May 25, 2018

I'd already create an issue about I could not rewrite an existing config into toml without reordering
I had toplevel declarations at the bottom which is also forbidden in toml right now
That's bad, my users got used to that order, switching to another syntax was not a problem at all, but reordering was.
It also is a step toward toml to be a drop-in replacement for json and yaml and a possibility to stream toml docs
So I totally pro this change

@pradyunsg
Copy link
Member

@ChristianSi These restrictions would be backwards incompatible. The intention of TOML 1.0 being backwards compatible with 0.4 is noted in the README. This is why I labelled this as a post 1.0.

Personally, it's the sort of thing where if we make this change, it'd mandate a 0.5 release (because this is the sort of change you want feedback on) and then a 1.0. I do think that if we don't do a 0.5 (which is what I'd prefer), 1.0 should mention that "hey there'll be restrictions on these in the future", possibly with some guarantee about what would stay valid.

@ChristianSi
Copy link
Contributor

@pradyunsg You're right, the proposed restrictions, when applied to [tablename]-tables ([]-tables) would break 0.4 files and so cannot go into 1.0 My intuition tells me that they won't go into 2.0 either, because the advantage they offer is just too small to justify breakage. Also, @sgarciac has given a reasonable example of why these restrictions might actually be bad idea.

What we could introduce now without introducing breakage are restrictions on the order of dotted keys, if we want to have those. And my intuition again tells me that we should introduce those now if we want to have them, for the same reason as given above.

So what we should decide now is whether to enforce such restrictions, say (informally): "All dotted keys that define a subtable must be placed together" – that would amount to what you have proposed above, but only for dotted keys, not for []-tables. Personally, I'm pretty undecided on whether we need such a restriction or whether we should just consider it a best practice – but I think anyone who wants it should speak out in favor of it now, since after the next tagged version is out, this too would become a breaking change and hence becomes considerably more unlikely to happen at all.

@eksortso
Copy link
Contributor

eksortso commented Jun 5, 2018

All of the keys defining a table should be kept in one range of lines, in principle. That's the case with tables defined with header rows. That's the case with inline tables, which in nearly all cases take a single line. And that ought to be the case with dotted keys as well. This principle doesn't apply to subtables and non-root supertables, which can be defined in any order (and dotted key paths allow subtables to be defined in the middle of their parent tables' definitions). But for scalar values, arrays, and inline tables, all of their keys should be kept in one continuous range.

@eksortso
Copy link
Contributor

eksortso commented Jun 5, 2018

Will restrictions be placed on defining subtables of tables defined using inline table values?

@pradyunsg
Copy link
Member

What we could introduce now without introducing breakage are restrictions on the order of dotted keys, if we want to have those. And my intuition again tells me that we should introduce those now if we want to have them, for the same reason as given above.

My intuition is quite the opposite -- this would mean there's an inconsistency in the ability to declare using dotted keys vs tables and simple keys.

@ChristianSi
Copy link
Contributor

@pradyunsg

this would mean there's an inconsistency in the ability to declare using dotted keys vs tables and simple keys.

Good point. And yet, consider this case:

mainkey1 = '...'
subtable.key1 = '...'
subtable.key2 = '...'
mainkey2 = '...'

That's legal with dotted keys and it will remain legal even with the restrictions you've proposed above. Indeed, it has to remain legal, since otherwise sorting all the keys in a table e.g. alphabetically would become impossible. But It's not possible using tables with simple keys.

Hence I'd say we don't have to worry about consistency all that much, since true consistency is impossible anyway.

@pradyunsg
Copy link
Member

That's indeed an interesting case and a compelling argument. In that case, let's move the discussion for dotted-keys restrictions back to #499.

@LongTengDao
Copy link
Contributor

LongTengDao commented Aug 17, 2019

  1. Generally speaking, keys out of order is not wise.
  2. @sgarciac had given a very good sample for proper use, so it should still be valid.
  3. But in any case, below should be invalid, at least warn not to use in the 1.0 spec (then things like "after 1.0 is released, more and more files might rely no such restrictions being present, and there will be a great of deal of hesitationchanges" maybe won't happen), until the users found a good practice and tell us in the issue:
[[a]]
[b]
[[a]]

@septatrix
Copy link

I am not even sure what a sensible API could be implemented to allow for a TOML parser which does not keep all tables in memory. Returning a dict, object, or whatever it is called in your language of choice as most implementations do it would not work in this case as then all data would have to be kept in memory regardless. So I can think of two possible APIs:

  1. It would be possible to implement an API which asks for specific keys and only then is the document parsed (e.g. toml.get(file, "foo.bar[1].baz"). This, however, is already possible today. One difference is that this restriction would allow for an early return once the table has been "closed" though doing so would in practice also permit parsing of invalid files such as those with keys defined twice. So no memory improvement and only a slight runtime optimization at the cost of a non-conformant implementation. And if one wants to use a non-conformant implementation they might as well already do so today.
  2. An alternative option is to stream out the configuration though even then I am not sure how exactly this would look - you cannot stream out dictionaries because then you would still have to store the inner dictionaries in memory. So the only option I can think of is streaming out the fully qualified key together with its value (e.g. (foo.bar[1].hello, 0), (foo.bar[1].world, 1), ...). Again, we have no real advantage from "closing" a table early. We could signal this to the application which might perform some tasks earlier but that again is not a memory improvement, only a slight runtime optimization while simultaneously waiving the option to warn of invalid configurations when later on duplicate keys are found.
  3. In some rather low-level languages and embedded use-cases there can be advantages from explicitly closing a dictionary, allowing for the removal of some temporary structures, freeing unused reserved space etc. However, these are quite rare and when you need to parse a config file anywhere close to a size where this would be problematic you already took a wrong turn somewhere else. Furthermore it is not uncommon in these cases to have a known schema which would already allow for optimizations in this direction - and often there are libraries in this area which consciously only implement a subset of specifications which could already be done today without hindering the rest of the ecosystem.

If someone can come up with a reasonable API which show that non-negligible memory or performance gains can be achieved with such a restriction I would be happy to hear about it. Until then this is mostly a subjective discussion about what feels more "human" and "obvious". For this aspect there is already a great argument in #446 (comment) of where out-of-order table definitions are desired and advantageous. So restricting this - at the additional cost of breaking compatibility - seems like a no-go to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests