New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSS Extra: (1) rename attr-name
, attr-value
(2) change pseudo-class
token pattern (3) add tag
, selector-list
, combinator
token
#2373
Conversation
…g detection, and added corresponding unit test.
… in `inside` object, Fix relational tests.
… `greedy` property to `true`
…ass` token (3) add `tag`, `selector-list`, `combinator` token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
That's good. In the original PR, I intentionally didn't choose these classes because they are colored by almost every theme and Lea once said that this might not be optimal.
Thinking about this some more, I also think that we should make itattr-name
andattr-value
because these classes are the ones used by markup. -
I don't like it. Matching an expression of nested brackets should be avoided wherever possible for a few reasons. The first reason is that it's not possible for any number of nested brackets (see below) but the main problem is that you have to know what's inside the brackets. I.e. there could be a pseudo-class like this
:not(span[attr=")"])
. You would also have to keep track of comments and strings inside every level of nested brackets. This can be done but the resulting pattern will probably be around 300~500 characters long.The only advantage of your approach is that you capture the nested structure of pseudo-classes. I also wanted to implement this at some point but decided against it for the above reasons. The current solution is simple and doesn't have any of the above problems.
-
See comments.
To pass it, I have to add
\([^\(]*\([^\)]*\)[^\)
to match two brackets which are nested. It already have something incorrect, can't match three brackets, and more. After all the regex's length can't be infinity.
I have no idea to solve it, could you have better solution?
There is no better solution. Matching any number of nested brackets requires a context-free grammar. The best you can do with regular expressions is to support up to a finite number of nested levels. But if you do that, then the length (and complexity) of your regex will grow in O(n) (or O(2^n), depends) with the number of nestings.
The Table layout and grid layout both have the concept of Let's take a quick look at these two elements through a simple example. For example, the HTML code of the table is as follows: <table border="1" width="600">
<colgroup>
<col span="1">
<col span="2" class="ancestor">
<col span="2" class="brother">
</colgroup>
<tr>
<td></td>
<th scope="coi">Descendant combinator</th>
<th scope="coi">Child combinator</th>
<th scope="coi">Adjacent sibling combinator</th>
<th scope="coi">General sibling combinator</th>
</tr>
<tr>
<th scope="row">example</th>
<td>.foo .bar {}</td>
<td>.foo > .bar {}</td>
<td>.foo + .bar {}</td>
<td>.foo ~ .bar {}</td>
</tr>
</table> It can be seen that the table has 5 columns. Among them, there are 3 .ancestor {
background-color: dodgerblue;
}
.brother {
background-color: skyblue;
} The result is: But sometimes our cell does not belong to exactly which column, but cross-column, at this time, the <table border="1" width="200">
<colgroup>
<col span="2">
<col class="selected">
</colgroup>
<tbody>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td colspan="2">D</td>
<td>E</td>
</tr>
<tr>
<td>F</td>
<td colspan="2">G</td>
</tr>
</tbody>
</table> col.selected {
background-color: skyblue;
} At this time, only cell C and E have a sky blue background color. Although cell G also covers the third column, it is also ignored because it also belongs to the second column. The effect is as follows: So there is a problem. In many cases, we just want cell G to have a background color. As long as the column is included, it is considered to be the target object. In response to this demand, column combinator came into being. The column combinator is written as a double pipe ( With the following CSS selector, we can make cell G also have a background color: .col.selected || td {
background-color: skyblue;
} The meaning of So, you can see the effect shown below(in future): The above result was achieved with inline-style, because there is no browser support it now. |
I want to highlight the If add prefix, such as If the CSS can't match I want each word of code can find its own token, and be highlighted. |
Yes, I will change it. the |
Shall we use JavaScript instead of regex to deal with this situation? |
…`tag` behind the `n-th` token to avoid problem.
But after add I read the RFC docs, found its legal value are so many that it is difficult to list them, so please allow me to be lazy and use the value in this case as a |
Put the |
(You know that you can replay to comments directly, right?)
How?
That's a problem. I don't see any easy solution for this one. Our CSS definition would have to know that argument type of every pseudo-class. Unfortunately |
Yes, next time.
I see Web Components and know the rule of tag name. Consider the narrow meaning of
In .NET Framework, the regex has a more strong function which named Balanced Constructs. It can help me match the nested charaters. Although the regex engine of JavaScript can't support it, we can do it in pure JavaScript in its thought. Let me describe the general idea: We use stack as data structure, then scan the code from beginning to end, when matched Implementing this idea needs to consider a large number of situations and designing various tests. But if we do it, it can finally be made. An extra question is I don't know how to replace the regex to JavaScript function in tokenize. |
…est every character.
I really think it's best just to drop The PDA proposal is really interesting. We actually have a few languages that have exactly this problem and up until now we just worked around using these regex approximations I described above. However! I still think that CSS doesn't need it.
|
I accept your point that do not add defective code before have found a perfect solution. Finally I removed the Until now, the title which contains three parts of change, the first one is kept, the second one is revoked, the last one modified some code and removed |
Thank you for contributing! Regarding the nested matching: I opened an issue to a library for this to Prism. I'm currently trying to figure out how to implement this, so if you know something, please point me in the right direction. |
Two years ago, I used such parsimmon to make a syntax parser for a SQL-Like language, I think it can also help us to parse the nested and even recursive syntax. Next I try to use this library to achieve the matching of complete pseudo-classes. In order to do this , first of all, I should get this library and use it, such as npm or cdn, to make it easy, I choose JSRun, which similar as JSFiddle, but this site is located in China, and the visit is relatively fast and more language advantages for me. After importing // These property can be used in following
const {seq, alt, regex, string, lazy, succeed} = Parsimmon; const pseudoClass = regex(/:[\w-]+/); It can match pseudoClass.parse(':hover') // {status: true, value: ':hover'} To match many pseudoClass such as const pseudoClass = regex(/:[\w-]+/)
.many();
pseudoClass.parse(':hover:active') // {status: true, value: [':hover', ':active']} To make the result similar to Prism, we can add (The const pseudoClass = regex(/:[\w-]+/)
.map(token => ({type: 'pseudo-class', content: token}))
.many();
pseudoClass.parse(':hover:active').value // [{type: 'pseudo-class', content: ':hover'}, {type: 'pseudo-class', content: ':active'}] Now considering bracket:
const pseudoClass = seq(
regex(/:[\w-]+/),
string('('),
string(')')
)
.map(token => ({type: 'pseudo-class', content: token}))
.many();
pseudoClass.parse(':hover()').value // [{type: 'pseudo-class', content: [':hover', '(', ')']}] Until now, function opt(parser) {
// use `null` to match a empty result
return parser.or(succeed(null))
}
const pseudoClass = seq(
regex(/:[\w-]+/),
opt(
seq(
string('('),
string(')')
)
)
).map(token => ({type: 'pseudo-class', content: token})).many();
pseudoClass.parse(':hover():hover').value
// [{type: 'pseudo-class', content: [':hover', ['(', ')']}, {type: 'pseudo-class', content: [':hover', null]}] Then match the content between brackets, at first consider const pseudoClass = seq(
regex(/:[\w-]+/),
opt(
seq(
string('('),
alt(
regex(/#[\w-]+/),
regex(/\.[\w-]+/),
regex(/[\w-]+/)
),
string(')')
)
)
).map(token => ({type: 'pseudo-class', content: token})).many();
pseudoClass.parse(':not(tag):not(.class):not(#id)').value
// [
// {type: 'pseudo-class', content: [':not', ['(', 'tag', ')']]},
// {type: 'pseudo-class', content: [':not', ['(', '.class', ')']]},
// {type: 'pseudo-class', content: [':not', ['(', '#id', ')']]},
// ] At last, it is time to deal with the nesting problem, like the const pseudoClass = lazy(() => seq(
regex(/:[\w-]+/),
opt(
seq(
string('('),
opt(
alt(
regex(/#[\w-]+/),
regex(/\.[\w-]+/),
regex(/[\w-]+/),
).many()
),
opt(pseudoClass),
string(')')
)
)
).map(token => ({type: 'pseudo-class', content: token})).many());
pseudoClass.parse(':where(p:not(.class))').value
// [
// {
// type: 'pseudo-class',
// content: [
// ':where',
// [
// '(',
// ['p'],
// [
// type: 'pseudo-classs',
// content: [
// ':not',
// [
// '(',
// ['.class'],
// [],
// ')'
// ]
// ]
// ],
// ')',
// ]
// ]
// }
// ] So far we have basically implemented a pseudo-class matcher that supports recursion. Of course, there are some other unfinished parts. For example, the array of returned results should be processed by map to get better results. The selector only considers the Using a similar method, we can build a more powerful matching engine than pure regular expressions, to match what regex cannot do. |
The main goal shouldn't be to replace the current matching engine. One of Prism's strongest points is that it's easy to create new languages. If you can write some regexes, then you will be able to make a Prism language. (And we certainly don't want to rewrite 200+ languages.) But we should still be able to use There are also a few problems with using external libraries:
That being said, I can see why we'd want to use an external library. I quickly implemented a PDA by changing an NFA implementation of mine and it's not easy. 1k lines of code we'd have to maintain. |
My opinion is not replace, but supplement. Pure regex can't match nested recursion definition, but JavaScript can.
Before this, to match the pseudo-class, I consider using JavaScript as a supplement to regex. Parsimmon is a layer of encapsulation that matches JavaScript. We only need to modify the API calling method to support JavaScript function calls, which can gradually support more complex matching. Just like CSS Extra provides an extension to CSS, parsimmon (and the tokenize that calls it) provides an extension to the existing language, and all of this is optional. |
One thought regarding stack-based approaches: I know that it's technically enough to be a powerful matching engine but it's not easy to do. Trivial cases like nested brackets are easy but I really struggled to implement JS templates. A simplified grammar for JS templatestemplate = "`" ( "\\" any | "${" expression "}" | (?! "\\" | "${" | "`" ) any )* "`" ;
comment = (* the usual comment pattern *) ;
string = (* the usual string pattern *) ;
expression = "(" expression ")" | "[" expression "]" | "{" expression "}"
| string | comment | template | <rest> ; The nesting of What I want to say is: If the implementation really is stack-based, then the programmer should not responsible for managing the stack. |
…ombinators and selector lists (PrismJS#2373)
As the title means, this PR includes three parts.
rename
attr-name
,attr-value
inattribute
property in CSS extra tokenize.The previous property name of
attr-name
isattribute
, which looks as same as the wrapperattribute
property, so I changed it. The another reason, which is also the reason for renamingvalue
toattr-value
is I reviewed the theme CSS, found they haveattr-name
andattr-value
. The name of token would better to keep up with the CSS selector.By the way, when I created the first PR, I said that
id
in pug is different from the CSS and CSS-extern language, but now I changed my thought, the pug's language(especially the selector) is similar like the CSS, and after I reviewed the theme CSS and extra languages' tokenize, I found thatid
andclass
name can't be conflict, so I consider to change it to previous name, which isid
,class
rather thanattr-id
,attr-class
.change
pseudo-class
token pattern.This is so difficult that I try to do it.
In my opinion, CSS pseudo class should contain the class name,
(
, inside content and)
. so at first I change it from/:[-\w]+
to/:[-\w]+(\(.*?\))?/
, but it can't pass the:where(p:not(.class))
test, the regex is not good at handling pairing and nested symbols, such as(
and)
.To pass it, I have to add
\([^\(]*\([^\)]*\)[^\)
to match two brackets which are nested. It already have something incorrect, can't match three brackets, and more. After all the regex's length can't be infinity.I have no idea to solve it, could you have better solution?
In my opinion, the three or more pseudo class with brackets nested is almost hard to see, it is possibly that two is enough?
Inside the
pseudo-class
token, I add all the property inselectorInside
excepttag
andpseudo-class
itself.pseudo-class
could be infinite loop andtag
could match the wrong token (becausepseudo-class
isn't exist,tag
will match thehover
in:hover
, the result is incorrect).add
tag
,selector-list
,combinator
tokenThis is simple. I have add enough tests to improve their correctness.