Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-nesting-1] Can we relax the syntax further? #7961

Closed
LeaVerou opened this issue Oct 26, 2022 · 55 comments
Closed

[css-nesting-1] Can we relax the syntax further? #7961

LeaVerou opened this issue Oct 26, 2022 · 55 comments

Comments

@LeaVerou
Copy link
Member

LeaVerou commented Oct 26, 2022

In #7834 we resolved to relax the Nesting syntax to basically "& is only required to indicate a descendant selector only when the selector starts with an ident" (e.g. in & h1 the & is necessary, but & .foo can become just .foo and & > h1 can be just > h1).

I opened this issue to brainstorm about ways to relax the syntax even further and do away with the & for all descendants, without introducing infinite lookahead, so we can have our 🍰 and eat it too.

The problem

If we do not require a & before descendant element selectors, then when followed by a pseudo-class they can look like declarations (which are <ident> : and then anything goes, including another <ident>) to the parser. The parser cannot know if it's dealing with a declaration or a nested rule, until it sees a ; or {, which could come after an arbitrary number of tokens, hence unbounded lookahead.

Non-starters

  • No, we cannot require whitespace after : for declarations, minifiers currently remove that so there is a lot of code out there with declarations that do not include whitespace after :.
  • No, the parser cannot take the list of recognized properties or pseudo-classes into account when deciding whether it's dealing with a declaration or a rule. That would be a most unfortunate coupling, and wouldn't even completely solve the problem (e.g. font is both a valid property name and an HTML element — also with custom elements any vaid property name can also be an element selector).

Proposed algorithm

One way this problem is bounded is that there are only two distinct possibilities: either you have a declaration, or a selector.

In CSS, tokenization is context-less, i.e. parsing a declaration or rule creates the same tokens, it's only the higher-level structures that are different.

Assuming parsing a declaration takes O(M) time and parsing a rule takes O(N) time, it would theoretically solve the problem to naively parse every rule-or-declaration twice (one as declaration, one as rule), and then throw away the structure we don't need. Clearly, that's a silly idea, because that would take O(M+N) time for every rule-or-declaration.

One optimization would be to parse as a declaration (there are far more declarations than nested rules), and keep the list of raw tokens around until the ; or {. Then declarations continue to be parsed in O(M) time, and rules are parsed in O(M+N) time. The extra space needed is minimal, since we don't need to keep these tokens around after the current structure is parsed.

But also, as discussed in #7834, we can rule out the possibility of being a declaration very early for nearly every selector. The only exception is element selectors followed by a pseudo-class (e.g. strong:hover) which are fairly rare in nested stylesheets (you usually want to style the base selector as well, so it's usually & { /* ... */ &:hover {...} })

So in the end, declarations still take O(M) time, nearly all rules still take O(N) time, and some, but very few rules take O(M + N) time.
And there's probably more room for optimizations.
I'd love to hear from implementers whether this is feasible, and whether I'm missing something.

Edit: Another restriction of going that way is that we'd need to forbid {} from property values, including custom properties (having it in strings is fine). But usage of that in the wild seems very low (and a lot of that includes the braces in strings, which will always be fine). Note we can reserve property: { for future syntax, since pseudo-classes cannot be empty so there's no ambiguity there.

@tabatkins
Copy link
Member

The ambiguous case is a small subset of valid rules, but unfortunately it's all valid declarations, so any cost added to the parsing of them is going to balloon the cost of parsing virtually all stylesheets.

@tabatkins
Copy link
Member

(We have the ambiguity of invalid declarations with valid rules to worry about too, but that is a tiny percentage of declarations, so it's not nearly as bad. That is, for existing invalid decls like *color: red;, we'll now pay the cost of creating a rule that we later throw away when we realize it was a decl all along, but since they're so rare it's not a big deal iiuc.)

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 26, 2022

The ambiguous case is a small subset of valid rules, but unfortunately it's all valid declarations, so any cost added to the parsing of them is going to balloon the cost of parsing virtually all stylesheets.

That is exactly why I'm proposing to parse all ambiguous structures as declarations, so there's no cost added to them.

(We have the ambiguity of invalid declarations with valid rules to worry about too, but that is a tiny percentage of declarations, so it's not nearly as bad. That is, for existing invalid decls like *color: red;, we'll now pay the cost of creating a rule that we later throw away when we realize it was a decl all along, but since they're so rare it's not a big deal iiuc.)

But since these are invalid we throw them away anyway, no?

@tabatkins
Copy link
Member

I'm confused. If when the structure is ambiguous you definitely parse as a declaration, that means fieldset & {...} will just be an invalid declaration, rather than a rule, which doesn't seem to be what you're suggesting. That's just the existing Option 3.

If you mean "parse as a declaration but also do X to allow for reinterpreting as a rule later", then it is indeed a per-declaration cost to do X.

But since these are invalid we throw them away anyway, no?

Yeah, but today we throw them away as an immediate error-handling consequence - we see the * and just switch into "consume tokens until we see a semicolon" mode immediately without creating any additional data structures. With Nesting we'll start making a rule object on the assumption that it's gonna turn out to be a valid rule, and then later throw that out when it ends up being an invalid declaration; that's an extra cost. (Just a rare one.)

@LeaVerou
Copy link
Member Author

I'm confused. If when the structure is ambiguous you definitely parse as a declaration, that means fieldset & {...} will just be an invalid declaration, rather than a rule, which doesn't seem to be what you're suggesting. That's just the existing Option 3.

If you mean "parse as a declaration but also do X to allow for reinterpreting as a rule later", then it is indeed a per-declaration cost to do X.

I mean parse as a declaration, but keep its tokens around until the declaration finishes parsing (not forever). I suppose that's an extra cost per declaration, but it seems a very small one since, you have these tokens anyway.

@tabatkins
Copy link
Member

You have those tokens transiently today, yes. Keeping them around for the duration of the decl parse is precisely the extra cost that I've been told in the past is too much. It would be good to verify that this is indeed the case, but we can't just sweep it under the rug, as it is the precise reason we're not just doing Sass-style nesting today.

@LeaVerou
Copy link
Member Author

It would be good to get confirmation on that, but also, if keeping tokens around for the duration of parsing a declaration is such a big cost, another option could be to …not. Just keep the raw string around, or even just the index in the source. Of course in the rare cases where you have an element selector followed by a pseudo-class you have to re-tokenize, but it may be deemed a worthy tradeoff? I'm not trying to push a specific proposal, but we should all try to brainstorm about this before we rule the syntax we want out as impossible.

@dbaron
Copy link
Member

dbaron commented Oct 26, 2022

I think some of the things that at least some of today's parsers do today for error handling (e.g., have general rules for how to parse the high level style sheet constructs, and then specific parsers for values of specific properties, designed so that error handling bugs in the specific value parsers can't cause violations of the general rules) is sort of like keeping the tokens around. But I'm not sure if it's exactly like keeping the tokens around, and I think this may vary between implementations.

This is an area that I'm not super-familiar with because the CSS parser that I knew best (the old Gecko one) didn't do this sort of multi-layer parsing, but instead relied on having the error handling correct in all the specific value parsers.

I think you probably need to ask the folks who are closest to today's parsers, but perhaps with a slightly firmer write-up of what it is you'd propose to do.

@emilio
Copy link
Collaborator

emilio commented Oct 26, 2022

Gecko's CSS parser (https://github.com/servo/rust-cssparser) doesn't keep tokens around other than the very last token we tokenized (as a performance optimization):

https://github.com/servo/rust-cssparser/blob/d4784093d953c56efbb5aa79d94aed646e580e6b/src/parser.rs#L162

It has the ability to restart at arbitrary points. In fact we use them quite aggressively:

https://github.com/servo/rust-cssparser/blob/d4784093d953c56efbb5aa79d94aed646e580e6b/src/parser.rs#L514-L529

So my understanding is that implementing something like this should have no extra overhead for valid declarations, since we already have the parser state from before the start of the declaration even:

https://github.com/servo/rust-cssparser/blob/d4784093d953c56efbb5aa79d94aed646e580e6b/src/rules_and_declarations.rs#L243

It might have extra overhead for invalid declarations however, which are unsurprisingly rather common (due to vendor prefixes and other stuff...), so it'd need some measurement with different inputs to see how bad it is...

It should be feasible to measure this either inside or outside of Firefox btw, happy to help if someone wants to give that a shot.

@sesse
Copy link
Contributor

sesse commented Oct 26, 2022

It would be good to get confirmation on that, but also, if keeping tokens around for the duration of parsing a declaration is such a big cost, another option could be to …not. Just keep the raw string around, or even just the index in the source.

Blink's tokenizer would not be happy about that; we don't always keep the raw string around after tokenization, so it would not be easy for us to re-tokenize (and tokenization is a nontrivial part of the page load cost, so tokenizing things twice would also be bad). So that's both a confirmation that keeping the tokens around are bad, and a NAK on the workaround of re-tokenizing.

@lilles
Copy link
Member

lilles commented Oct 27, 2022

Even with lookahead, this is still ambiguous per core syntax since blocks are valid component values in a declaration:

div:hover span {}

For instance inside a style rule, does the semi-colon make a difference between:

span {
  div:hover span {}
  color: blue;
}

and:

span {
  div:hover span {};
  color: blue;
}

?

@tabatkins
Copy link
Member

Right, a consequence of resolving this is that curly-brace blocks are no longer a valid (top-level) component of properties. (And conversely, top-level semicolons are not valid in selectors, but that's true of the existing Option 3 already, since we have to avoid confusing invalid declarations for blocks.)

We haven't yet tried to use {} for anything in properties, and I think the loss to future extensibility is minimal.

@tabatkins
Copy link
Member

tabatkins commented Oct 27, 2022

Never mind, Lea reminded me that custom properties can have {} in them , so nope, this won't work, the ambiguity is indeed unresolvable (in some circumstances).

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 27, 2022

@sesse

Blink's tokenizer would not be happy about that; we don't always keep the raw string around after tokenization, so it would not be easy for us to re-tokenize

How difficult/expensive would it be to change it to do so? Remember we only need to keep stuff around for the current declaration being parsed, not beyond that.

(and tokenization is a nontrivial part of the page load cost, so tokenizing things twice would also be bad).

As I described in the first post, retokenization would only be needed on a tiny fraction of nested rules (namely element selectors followed by a pseudo-class).

So that's both a confirmation that keeping the tokens around are bad, and a NAK on the workaround of re-tokenizing.

I opened this issue so we could all think hard about this problem, instead of treating it as insurmountable from the get go. So far we have been designing syntax around it, axiomatically accepting it cannot possibly be solved in a performant way. I'd like us to challenge that assumption, and for that we need more details from implementers than "NAK". If all implementations have similar issues, that's a stronger argument, but I’m wary about designing syntax that will be suboptimal for authors forever, so that Nesting can be implemented without many changes to Blink's CSS parser. Could we please examine the impact more closely, and maybe do actual measurements, as @emilio is offering to do for Gecko?


@tabatkins Yeah, that's unfortunate. I could ping the HttpArchive folks to measure how many custom property declarations actually contain curly braces. I suspect it's going to be a negligible amount. It is already illegal to include an opening curly brace without a closing one, so it's not like custom properties are completely unrestricted rn and would have to become restricted — they'd just become a tiny bit more restricted.

@sesse
Copy link
Contributor

sesse commented Oct 27, 2022

How difficult/expensive would it be to change it to do so?

The best numbers I have is that it would cost us about 2% of total page load time (not just parse time; all of load time), since we'd need to remove some optimizations that have been said to yield approximately that.

So far we have been designing syntax around it, accepting it cannot possibly be solved in a performant way as an axiom. I'd like us to challenge that assumption, and for that we need more details from implementers than "NAK".

I don't share this goal; I would like us to just get the syntax (any syntax) specified, out the door and be done with it. So no, I won't be spending resources trying to bend our minds around how we tokenize CSS to this avail; I can NAK what is a non-starter for us, and that's what you'll get.

@Loirooriol
Copy link
Contributor

Regardless of whether implementations are willing accept the performance impact, this would also make it more difficult for authors to know whether something is a style declaration or a nested rule in edge cases, thus making the language more messy. I don't like this at all.

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 27, 2022

@sesse

since we'd need to remove some optimizations that have been said to yield approximately that.

More details would be helpful here. What kind of optimizations? Why would you need to remove them? For which approach of the ones discussed would you need to remove them?

I don't share this goal; I would like us to just get the syntax (any syntax) specified, out the door and be done with it. So no, I won't be spending resources trying to bend our minds around how we tokenize CSS to this avail; I can NAK what is a non-starter for us, and that's what you'll get.

That’s …not a great attitude. 😕 We definitely don’t want to get "any syntax" out the door and be done with it, this is something that millions of developers will be using all over their stylesheets. It matters to get it right.


@Loirooriol

Regardless of whether implementations are willing accept the performance impact, this would also make it more difficult for authors to know whether something is a style declaration or a nested rule in edge cases, thus making the language more messy. I don't like this at all.

This is about dropping the requirement of an explicit & in front of descendant selectors that start with an element selector. We have already resolved to make it optional in all other cases in #7834, so not sure how making it optional in all cases makes the language messier, if anything not having this exception makes it less messy. This is exactly the syntax nested rules would have in Option 4, which I see you voted you prefer over the now current (option 3) syntax. The ability to write selectors like this is the main reason people are considering option 4, so if we could get the same syntax and actually nest, it seems the best of both worlds. Also, when we polled authors, half wanted to omit the & whenever possible in a sample size of 1.6K people, which is not negligible.

@Loirooriol
Copy link
Contributor

The difference is that when restricting to "all other cases", you only need to look at the beginning in order to know if its a selector or a declaration. But with this it won't be easy at all. And yes, I prefer option 4 over 3, but then it's not such a problem because declarations and rules are separate so no confusion at all. And each person can like option 4 for different reasons, being able to omit & is not a reason for me, since my preferred option is 1 with & required.

@Loirooriol
Copy link
Contributor

I think the loss to future extensibility is minimal

I'm concerned this may not be true, e.g. I remember some proposal to use {} to set multiple longhands to different values without having to repeat all the name, like

#element {
  margin: {
    top: 10px;
    left: 20px;
  }
}

Is margin: a type selector and an hypothetical empty pseudo-class, or is it a declaration for the margin property?

@LeaVerou
Copy link
Member Author

Is margin: a type selector and an hypothetical empty pseudo-class, or is it a declaration for the margin property?

This is not a thing, pseudo-classes cannot be empty.

@romainmenke
Copy link
Member

I'm concerned this may not be true, e.g. I remember some proposal to use {} to set multiple longhands to different values without having to repeat all the name, like

Can we start a list of all the stuff we are sacrificing specifically to make this style of nesting possible?

I don't think we can really judge the impact of all this without mapping it out exactly.

@Loirooriol
Copy link
Contributor

I know pseudo-classes can't be empty, that's why I said "hypothetical". Still, the concern is that, if I'm reading the previous comments right, to avoid ambiguities when parsing, declarations couldn't have {}, so the proposal to use {} for setting longhands would be lost.

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 27, 2022

@romainmenke

I'm concerned this may not be true, e.g. I remember some proposal to use {} to set multiple longhands to different values without having to repeat all the name, like

Can we start a list of all the stuff we are sacrificing specifically to make this style of nesting possible?

I don't think we can really judge the impact of all this without mapping it out exactly.

I understand that you and a couple other people from #7834 feel very strongly about this and disagree with the recent WG resolution. However, can we please keep that discussion in #7834? I opened a separate issue specifically so we could discuss implementation challenges. Whether this syntax is possible is orthogonal to whether it's desirable (though polling authors has shown time and time again that it is).


@Loirooriol

Declaration values starting with { should be feasible (@tabatkins can correct me), since that's unambiguous, as pseudo-classes cannot be empty, so that in particular would be doable. What wouldn't be is declarations using pairs of braces somewhere in the middle.

@romainmenke
Copy link
Member

romainmenke commented Oct 27, 2022

What we are giving up is related to what we are realising. These can not be separated. And I don't like waving it away indefinitely.

These additions to nesting do have a cost and it should be recorded.

It helps everyone to make a better and more informed choice. It is weird that I have to advocate for fully exploring and describing the downsides. I am asking to do this in a separate issue.

I don't see why we should not do this.

@tabatkins
Copy link
Member

Declaration values starting with { should be feasible (@tabatkins can correct me), since that's unambiguous, as pseudo-classes cannot be empty, so that in particular would be doable.

Yes, this distinction would be possible.

@rviscomi
Copy link

Yeah, that's unfortunate but I could ping the httparchive guys to measure how many custom property declarations actually have curly braces in them, I suspect it's going to be a negligible amount.

@LeaVerou reached out to me about this analysis. I found braces in 25,804 custom property values on 2,128 sites (0.01%) in the October HTTP Archive dataset of 15.6 million sites. I'll leave it up to all of you to decide how valid/correct this usage is in a few examples:

https://gocmod.com/xingtu-viet-hoa-vip/

--fonts:  {"font-family":"Roboto","font-options":"","google":"1"}

https://www.familysearch.org/

--fs-mixin-overflow-ellipsis: {"white-space":"(nowrap)","overflow":"(hidden)","text-overflow":"(ellipsis)"}

https://www.wheeloffortune.com/watch

--vh: ${vh}px

https://www.bitrix24.net/

--ui-btn-clock-white: url("data:image/svg+xml;charset=utf-8,%3Csvg width='19' height='19' xmlns='http://www.w3.org/2000/svg'%3E%3Cstyle%3E@keyframes arrow-loader{0%25{transform:rotate(0deg)}to{transform:rotate(360deg)}}%3C/style%3E%3Cg style='animation:arrow-loader 1s infinite linear;transform-origin:center' stroke='%23fff'%3E%3Cpath fill='none' d='M.5 9.475a8.976 8.976 0 0117.95 0 8.976 8.976 0 01-17.95 0z'/%3E%3Cpath d='M9.5 4v5.5'/%3E%3C/g%3E%3Cpath stroke='%23fff' fill='transparent' d='M15 9.5H9.5'/%3E%3C/svg%3E")

Let me know if this was helpful or if there's any other follow-up analysis you need.

@emilio
Copy link
Collaborator

emilio commented May 16, 2023

@tabatkins an interesting thing that I came across while working on nesting. Presumably this changes error recovery in some cases right? E.g., a stray semicolon right now makes the following @media rule invalid, but it probably shouldn't after this.

I think that both boxes here should be the same color (and probably green):

<!doctype html>
<style>
div {
  width: 50px; height: 50px;
  background-color: red;
}

@media screen {
  #a { background-color: yellow }
  ;
  #a { background-color: green } 
}

:root {
  @media screen {
    #b { background-color: yellow }
    ;
    #b { background-color: green } 
  }
}

</style>
<div id="a"></div>
<div id="b"></div>

Does that match your intuition? There are some tests that effectively expect the yellow behavior in the top level context, but I think it'd be unfortunate if stuff was inconsistent here...

@tabatkins
Copy link
Member

Yes, this was noted as the (afaict) only non-editorial change not directly related to Nesting itself in #8834.

I agree that this is almost certainly a good change; it's definitely preferable that @media and related rules parse the same whether they're nested or not. This also makes the spec architecture in general a little better, because it means I no longer have to make the core parsing behavior of blocks depend on what kind of block it is.

At the moment I still have top-level parsing preserve the old behavior, because I suspect the compat impact is larger, but I'd be willing to change (letting the top-level context parse declarations and just throw them away) if someone's willing to put in the legwork to determine compat.

(If we could continue discussion on this topic in #8834 that would be great.)

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Jun 5, 2023
…abled.

See w3c/csswg-drafts#7961 (comment)

In general I agree with Tab that media blocks should parse the same
regardless of whether they are nested.

We should add a test that covers this to WPT, but for now tweak the
local reftest.

MANUAL PUSH: Trivial orange fix CLOSED TREE
ErichDonGubler pushed a commit to ErichDonGubler/firefox that referenced this issue Jun 7, 2023
…abled.

See w3c/csswg-drafts#7961 (comment)

In general I agree with Tab that media blocks should parse the same
regardless of whether they are nested.

We should add a test that covers this to WPT, but for now tweak the
local reftest.

MANUAL PUSH: Trivial orange fix CLOSED TREE
moz-wptsync-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jun 28, 2023
As per w3c/csswg-drafts#7961 bare tag
selectors and so are fine now.

Differential Revision: https://phabricator.services.mozilla.com/D181799

bugzilla-url: https://bugzilla.mozilla.org/show_bug.cgi?id=1839946
gecko-commit: b271d6ad5492b424ca52dd82097b08f90cd07be5
gecko-reviewers: boris
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Jun 28, 2023
moz-wptsync-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jun 28, 2023
As per w3c/csswg-drafts#7961 bare tag
selectors and so are fine now.

Differential Revision: https://phabricator.services.mozilla.com/D181799

bugzilla-url: https://bugzilla.mozilla.org/show_bug.cgi?id=1839946
gecko-commit: b271d6ad5492b424ca52dd82097b08f90cd07be5
gecko-reviewers: boris
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Jul 1, 2023
As per w3c/csswg-drafts#7961 bare tag
selectors and so are fine now.

Differential Revision: https://phabricator.services.mozilla.com/D181799

UltraBlame original commit: b271d6ad5492b424ca52dd82097b08f90cd07be5
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Jul 1, 2023
As per w3c/csswg-drafts#7961 bare tag
selectors and so are fine now.

Differential Revision: https://phabricator.services.mozilla.com/D181799

UltraBlame original commit: b271d6ad5492b424ca52dd82097b08f90cd07be5
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Jul 1, 2023
As per w3c/csswg-drafts#7961 bare tag
selectors and so are fine now.

Differential Revision: https://phabricator.services.mozilla.com/D181799

UltraBlame original commit: b271d6ad5492b424ca52dd82097b08f90cd07be5
ErichDonGubler pushed a commit to ErichDonGubler/firefox that referenced this issue Jul 5, 2023
@tabatkins
Copy link
Member

Specs should be updated to the resolutions, so closing now.

@yisibl
Copy link
Contributor

yisibl commented Aug 31, 2023

WebKit implemented here: WebKit/WebKit#17189

@AlbertMarashi
Copy link

AlbertMarashi commented Nov 27, 2023

By the way, if anyone implementing this was confused about how to know whether we're parsing a declaration or a rule this was my algorithm:

  • read_declaration_rule_or_at_rule(parser)
    • if parser.match("@")
      • return read_at_rule(parser)
    • let start_index = parser.index (current character index)
    • parser.read_until(SEMICOLON_OR_OPEN_BRACE_OR_CLOSE_BRACE) (don't consume the final char though ; { })
    • if parser.match("{")
      • parser.index = start_index (rewind to start of rule)
      • return read_rule(parser)
    • else we're reading a declaration (it ends with ; or } which are both valid endings)
    • parser.index = start_index (rewind to start of declaration)
    • return read_declaration(parser)

@tabatkins
Copy link
Member

You might want to check the "Implementation Note" in "consume a block's contents" for even more efficiency wins in this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests