Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS Extra: (1) rename attr-name, attr-value (2) change pseudo-class token pattern (3) add tag, selector-list, combinator token #2373

Merged
merged 16 commits into from May 8, 2020

Conversation

dev-itsheng
Copy link
Contributor

As the title means, this PR includes three parts.

  1. rename attr-name, attr-value in attribute property in CSS extra tokenize.

    The previous property name of attr-name is attribute, which looks as same as the wrapper attribute property, so I changed it. The another reason, which is also the reason for renaming value to attr-value is I reviewed the theme CSS, found they have attr-name and attr-value. The name of token would better to keep up with the CSS selector.

    By the way, when I created the first PR, I said that id in pug is different from the CSS and CSS-extern language, but now I changed my thought, the pug's language(especially the selector) is similar like the CSS, and after I reviewed the theme CSS and extra languages' tokenize, I found that id and class name can't be conflict, so I consider to change it to previous name, which is id, class rather than attr-id, attr-class.

  2. change pseudo-class token pattern.

    This is so difficult that I try to do it.

    In my opinion, CSS pseudo class should contain the class name, (, inside content and ). so at first I change it from /:[-\w]+ to /:[-\w]+(\(.*?\))?/, but it can't pass the :where(p:not(.class)) test, the regex is not good at handling pairing and nested symbols, such as ( and ).

    To pass it, I have to add \([^\(]*\([^\)]*\)[^\) to match two brackets which are nested. It already have something incorrect, can't match three brackets, and more. After all the regex's length can't be infinity.

    I have no idea to solve it, could you have better solution?

    In my opinion, the three or more pseudo class with brackets nested is almost hard to see, it is possibly that two is enough?

    Inside the pseudo-class token, I add all the property in selectorInside except tag and pseudo-class itself. pseudo-class could be infinite loop and tag could match the wrong token (because pseudo-class isn't exist, tag will match the hover in :hover, the result is incorrect).

  3. add tag, selector-list, combinator token

    This is simple. I have add enough tests to improve their correctness.

Copy link
Member

@RunDevelopment RunDevelopment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That's good. In the original PR, I intentionally didn't choose these classes because they are colored by almost every theme and Lea once said that this might not be optimal.
    Thinking about this some more, I also think that we should make it attr-name and attr-value because these classes are the ones used by markup.

  2. I don't like it. Matching an expression of nested brackets should be avoided wherever possible for a few reasons. The first reason is that it's not possible for any number of nested brackets (see below) but the main problem is that you have to know what's inside the brackets. I.e. there could be a pseudo-class like this :not(span[attr=")"]). You would also have to keep track of comments and strings inside every level of nested brackets. This can be done but the resulting pattern will probably be around 300~500 characters long.

    The only advantage of your approach is that you capture the nested structure of pseudo-classes. I also wanted to implement this at some point but decided against it for the above reasons. The current solution is simple and doesn't have any of the above problems.

  3. See comments.


To pass it, I have to add \([^\(]*\([^\)]*\)[^\) to match two brackets which are nested. It already have something incorrect, can't match three brackets, and more. After all the regex's length can't be infinity.
I have no idea to solve it, could you have better solution?

There is no better solution. Matching any number of nested brackets requires a context-free grammar. The best you can do with regular expressions is to support up to a finite number of nested levels. But if you do that, then the length (and complexity) of your regex will grow in O(n) (or O(2^n), depends) with the number of nestings.

components/prism-css-extras.js Outdated Show resolved Hide resolved
components/prism-css-extras.js Outdated Show resolved Hide resolved
components/prism-css-extras.js Outdated Show resolved Hide resolved
@dev-itsheng
Copy link
Contributor Author

The || is column combinator which is a new combinator that has just appeared in the specification. At present, the compatibility of the browser is not enough for it to be used in actual projects, the description of MDN is so poor that please allow me introduce it briefly to know what it is used for.

Table layout and grid layout both have the concept of column. Sometimes we want to control the style of the whole column. There are two methods: one is by means of :nth-col() or :nth-last-col() pseudo-class, but currently browsers have not yet supported these two pseudo-classes; there is also a way to achieve the and elements in the native Table layout, this method is very compatible.

Let's take a quick look at these two elements through a simple example. For example, the HTML code of the table is as follows:

<table border="1" width="600">
    <colgroup>
        <col span="1">
        <col span="2" class="ancestor">
        <col span="2" class="brother">
    </colgroup>
    <tr>
        <td></td>
        <th scope="coi">Descendant combinator</th>
        <th scope="coi">Child combinator</th>
        <th scope="coi">Adjacent sibling combinator</th>
        <th scope="coi">General sibling combinator</th>
    </tr>
    <tr>
        <th scope="row">example</th>
        <td>.foo .bar {}</td>
        <td>.foo > .bar {}</td>
        <td>.foo + .bar {}</td>
        <td>.foo ~ .bar {}</td>
    </tr>
</table>

It can be seen that the table has 5 columns. Among them, there are 3 <col> elements in the <colgroup> element. As can be seen from the value of the span attribute, these 3 <col> elements occupy 1 column, 2 columns, and 2 columns, respectively. At this point, we set the background color for the next two <col> elements, and we can see that the background color is acting on the entire column. The CSS is as follows:

.ancestor {
    background-color: dodgerblue;
}

.brother {
    background-color: skyblue;
}

The result is:

image

But sometimes our cell does not belong to exactly which column, but cross-column, at this time, the <col> element will ignore these cross-column elements. The following is an example which modified from MDN:

<table border="1" width="200">
    <colgroup>
        <col span="2">
        <col class="selected">
    </colgroup>
    <tbody>
        <tr>
            <td>A</td>
            <td>B</td>
            <td>C</td>
        </tr>
        <tr>
            <td colspan="2">D</td>
            <td>E</td>
        </tr>
        <tr>
            <td>F</td>
            <td colspan="2">G</td>
        </tr>
    </tbody>
</table>
col.selected {
    background-color: skyblue;
}

At this time, only cell C and E have a sky blue background color. Although cell G also covers the third column, it is also ignored because it also belongs to the second column. The effect is as follows:

image

So there is a problem. In many cases, we just want cell G to have a background color. As long as the column is included, it is considered to be the target object. In response to this demand, column combinator came into being.

The column combinator is written as a double pipe (||), which is two characters, which is the same as the logical OR in the JavaScript language, but it does not mean "or" in CSS. It is more appropriate to use "belong" to explain.

With the following CSS selector, we can make cell G also have a background color:

.col.selected || td {
    background-color: skyblue;
}

The meaning of .col.selected || is to select all <td> elements belonging to col.selected, even if this <td> element spans multiple columns.

So, you can see the effect shown below(in future):

image

The above result was achieved with inline-style, because there is no browser support it now.

@dev-itsheng
Copy link
Contributor Author

It's consistent with markup but it also completely changes the look of CSS selectors.

I want to highlight the tag token in CSS selector, and found that each theme CSS contains .token.tag which can match it.

If add prefix, such as selector-id, selector-tag, or rewrite language's descendant combinator such as [class*="language-css"] .token.tag in CSS, it can make the same look.

If the CSS can't match tag token, its color is initial.

I want each word of code can find its own token, and be highlighted.

@dev-itsheng
Copy link
Contributor Author

I think this should be highlighted as punctuation.

Yes, I will change it. the selector-list is copied from CSS Level

@dev-itsheng
Copy link
Contributor Author

Matching an expression of nested brackets should be avoided wherever possible for a few reasons.

Shall we use JavaScript instead of regex to deal with this situation?

…`tag` behind the `n-th` token to avoid problem.
@dev-itsheng
Copy link
Contributor Author

The current solution is simple and doesn't have any of the above problems.

But after add tag token, something was happened. such as the en in :lang(en) matches tag. I consider to give them some special token like even and odd, they are n-th tokens.

I read the RFC docs, found its legal value are so many that it is difficult to list them, so please allow me to be lazy and use the value in this case as a tag token.

@dev-itsheng
Copy link
Contributor Author

Put the combinator and tag behind n-th can avoid the problem that + and n in :nth-child(+2n - 1) are matched to them.

@RunDevelopment
Copy link
Member

(You know that you can replay to comments directly, right?)

Matching an expression of nested brackets should be avoided wherever possible for a few reasons.

Shall we use JavaScript instead of regex to deal with this situation?

How?

But after add tag token, something was happened. such as the en in :lang(en) matches tag. I consider to give them some special token like even and odd, they are n-th tokens.

That's a problem. I don't see any easy solution for this one. Our CSS definition would have to know that argument type of every pseudo-class. Unfortunately :lang isn't the only one where the argument isn't a selector.
Maybe we could just leave out tag until we have a good solution?
This would also solve the issue that the current tag definition doesn't capture custom tags with non-ASCII characters. (See the markup tag definition.)

@dev-itsheng
Copy link
Contributor Author

(You know that you can replay to comments directly, right?)

Yes, next time.

This would also solve the issue that the current tag definition doesn't capture custom tags with non-ASCII characters. (See the markup tag definition.)

I see Web Components and know the rule of tag name.

Consider the narrow meaning of tag in selector, it contains the HTML5 tags and Web Components tags. In my opinion, There is no developer who use non-ASCII characters as tag name, so shall we keep the current changes and ignore other situations?

Matching an expression of nested brackets should be avoided wherever possible for a few reasons.

Shall we use JavaScript instead of regex to deal with this situation?

How?

In .NET Framework, the regex has a more strong function which named Balanced Constructs. It can help me match the nested charaters.

Although the regex engine of JavaScript can't support it, we can do it in pure JavaScript in its thought.

Let me describe the general idea:

We use stack as data structure, then scan the code from beginning to end, when matched (, then push it, when matched ), pop it, the similar as the ' and " which some brackets in string grammar, when ' or " in the stack, the ( or ) should be ignored.

Implementing this idea needs to consider a large number of situations and designing various tests. But if we do it, it can finally be made.

An extra question is I don't know how to replace the regex to JavaScript function in tokenize.

@RunDevelopment
Copy link
Member

Consider the narrow meaning of tag in selector, it contains the HTML5 tags and Web Components tags. In my opinion, There is no developer who use non-ASCII characters as tag name, so shall we keep the current changes and ignore other situations?

I really think it's best just to drop tag for now. You already showed that it's easy to get false positives and if it's not consistent with markup, then why have it?


The PDA proposal is really interesting. We actually have a few languages that have exactly this problem and up until now we just worked around using these regex approximations I described above.
I'll investigate a bit further.

However! I still think that CSS doesn't need it.

An extra question is I don't know how to replace the regex to JavaScript function in tokenize.

tokenize doesn't care what you give it as long as it behaves like a RegExp. So if we implement exec and lastIndex, tokenize will just use it. (This ignores the little check for the global flag but we can get around that.)

@dev-itsheng
Copy link
Contributor Author

I accept your point that do not add defective code before have found a perfect solution.

Finally I removed the tag token and relative tests, but I think the developers or users more than I will consider "Why are there no tags here?" And try to add this rule. To avoid letting them fall into this kind of conclusion we have reached an agreement, I left some comment which point to this link.

Until now, the title which contains three parts of change, the first one is kept, the second one is revoked, the last one modified some code and removed tag token.

@RunDevelopment RunDevelopment merged commit e523f5d into PrismJS:master May 8, 2020
@RunDevelopment
Copy link
Member

Thank you for contributing!

Regarding the nested matching: I opened an issue to a library for this to Prism. I'm currently trying to figure out how to implement this, so if you know something, please point me in the right direction.

@dev-itsheng
Copy link
Contributor Author

Two years ago, I used such parsimmon to make a syntax parser for a SQL-Like language, I think it can also help us to parse the nested and even recursive syntax.

Next I try to use this library to achieve the matching of complete pseudo-classes. In order to do this , first of all, I should get this library and use it, such as npm or cdn, to make it easy, I choose JSRun, which similar as JSFiddle, but this site is located in China, and the visit is relatively fast and more language advantages for me.

After importing https://unpkg.com/parsimmon@1.13.0/build/parsimmon.umd.min.js, we can get global Parsimmon object.

// These property can be used in following
const {seq, alt, regex, string, lazy, succeed} = Parsimmon;
const pseudoClass = regex(/:[\w-]+/);

It can match :hover:

pseudoClass.parse(':hover')     // {status: true, value: ':hover'}

To match many pseudoClass such as :hover:active, we can add .many():

const pseudoClass = regex(/:[\w-]+/)
                        .many();

pseudoClass.parse(':hover:active')  // {status: true, value: [':hover', ':active']}

To make the result similar to Prism, we can add .map() and set the param to a Object with type property:

(The map method is for token rather than array, so it should be set before .many())

const pseudoClass = regex(/:[\w-]+/)
                        .map(token => ({type: 'pseudo-class', content: token}))
                        .many();

pseudoClass.parse(':hover:active').value  // [{type: 'pseudo-class', content: ':hover'}, {type: 'pseudo-class', content: ':active'}]

Now considering bracket:

  • The bracket's structure is so easy that we use string rather than regex;
  • The bracket is a structure that are side by side with pseudo-class, so we use seq.
const pseudoClass = seq(
                        regex(/:[\w-]+/),
                        string('('),
                        string(')')
                    )
                        .map(token => ({type: 'pseudo-class', content: token}))
                        .many();

pseudoClass.parse(':hover()').value  // [{type: 'pseudo-class', content: [':hover', '(', ')']}]

Until now, :xxx() can be matched, but :xxx can't, we should make the bracket optional, create .opt function by .or and use it:

function opt(parser) {
    // use `null` to match a empty result
    return parser.or(succeed(null))
}

const pseudoClass = seq(
    regex(/:[\w-]+/), 
    opt(
        seq(
            string('('), 
            string(')')
        )
    )
).map(token => ({type: 'pseudo-class', content: token})).many();

pseudoClass.parse(':hover():hover').value   
// [{type: 'pseudo-class', content: [':hover', ['(', ')']}, {type: 'pseudo-class', content: [':hover', null]}]

Then match the content between brackets, at first consider :not(), includes tag, class, id. To match one of the three possible values, we use .alt:

const pseudoClass = seq(
    regex(/:[\w-]+/), 
    opt(
        seq(
            string('('),
            alt(
                regex(/#[\w-]+/), 
                regex(/\.[\w-]+/), 
                regex(/[\w-]+/)
            ),
            string(')')
        )
    )
).map(token => ({type: 'pseudo-class', content: token})).many();

pseudoClass.parse(':not(tag):not(.class):not(#id)').value
// [
//     {type: 'pseudo-class', content: [':not', ['(', 'tag', ')']]},
//     {type: 'pseudo-class', content: [':not', ['(', '.class', ')']]},
//     {type: 'pseudo-class', content: [':not', ['(', '#id', ')']]},
// ]

At last, it is time to deal with the nesting problem, like the :where(p:not(.class)). To use tokenize recursively, we should use .lazy:

const pseudoClass = lazy(() => seq(
    regex(/:[\w-]+/), 
    opt(
        seq(
            string('('),
            opt(
                alt(
                    regex(/#[\w-]+/), 
                    regex(/\.[\w-]+/), 
                    regex(/[\w-]+/),
                ).many()
            ),
            opt(pseudoClass),
            string(')')
        )
    )
).map(token => ({type: 'pseudo-class', content: token})).many());

pseudoClass.parse(':where(p:not(.class))').value
// [
//     {
//         type: 'pseudo-class', 
//         content: [
//             ':where',
//             [
//                  '(',
//                  ['p'],
//                  [
//                      type: 'pseudo-classs',
//                      content: [
//                          ':not',
//                          [
//                              '(',
//                               ['.class'],
//                               [],
//                              ')'
//                         ]
//                      ]
//                  ],
//                  ')',
//              ]
//         ]
//     }
// ]

So far we have basically implemented a pseudo-class matcher that supports recursion. Of course, there are some other unfinished parts. For example, the array of returned results should be processed by map to get better results. The selector only considers the id / class / tag, in fact, there is an attribute selector, there are even, odd, language that meets the tag syntax, the content in brackets may also have numbers(2n+1), different selectors are not separated(we can use opt for each selector and determine the array index of the returned result, or can use the regular expression or .startsWith method to determine the type) etc.

Using a similar method, we can build a more powerful matching engine than pure regular expressions, to match what regex cannot do.

@RunDevelopment
Copy link
Member

The main goal shouldn't be to replace the current matching engine. One of Prism's strongest points is that it's easy to create new languages. If you can write some regexes, then you will be able to make a Prism language. (And we certainly don't want to rewrite 200+ languages.)

But we should still be able to use parsimmon if we wrap it so that it has the same API as a RegExp.

There are also a few problems with using external libraries:

  1. Size: parsimmon is 14kB which is about the size of 5~10 languages.
  2. Speed: Creating the AST that these parsers usually build takes time and disallows some optimizations. We don't need any of that.
  3. Integration: We don't have a good way of importing external libraries. The easiest is to just copy the library and make it a prism-xxxx.js file. Not nice.

That being said, I can see why we'd want to use an external library. I quickly implemented a PDA by changing an NFA implementation of mine and it's not easy. 1k lines of code we'd have to maintain.

@dev-itsheng
Copy link
Contributor Author

dev-itsheng commented May 13, 2020

The main goal shouldn't be to replace the current matching engine.

My opinion is not replace, but supplement. Pure regex can't match nested recursion definition, but JavaScript can.

We use stack as data structure, then scan the code from beginning to end, when matched (, then push it, when matched ), pop it, the similar as the ' and " which some brackets in string grammar, when ' or " in the stack, the ( or ) should be ignored.

Before this, to match the pseudo-class, I consider using JavaScript as a supplement to regex. Parsimmon is a layer of encapsulation that matches JavaScript. We only need to modify the API calling method to support JavaScript function calls, which can gradually support more complex matching.

Just like CSS Extra provides an extension to CSS, parsimmon (and the tokenize that calls it) provides an extension to the existing language, and all of this is optional.

@RunDevelopment
Copy link
Member

One thought regarding stack-based approaches:

I know that it's technically enough to be a powerful matching engine but it's not easy to do. Trivial cases like nested brackets are easy but I really struggled to implement JS templates.

A simplified grammar for JS templates
template   = "`" ( "\\" any | "${" expression "}" | (?! "\\" | "${" | "`" ) any )* "`" ;
comment    = (* the usual comment pattern *) ;
string     = (* the usual string pattern *) ;
expression = "(" expression ")" | "[" expression "]" | "{" expression "}"
           | string | comment | template | <rest> ;

The nesting of template and expression is really hard to do.

What I want to say is: If the implementation really is stack-based, then the programmer should not responsible for managing the stack.

quentinvernot pushed a commit to TankerHQ/prismjs that referenced this pull request Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants