diff --git a/extending.html b/extending.html index 606c2816fb..2015bb87c9 100644 --- a/extending.html +++ b/extending.html @@ -31,71 +31,328 @@

Extending Prism

Language definitions

-

Every language is defined as a set of tokens, which are expressed as regular expressions. For example, this is the language definition for CSS:

-

+	

Every language is defined as a set of tokens, which are expressed as regular expressions. For example, this is the language definition for JSON:

+

 
-	

A regular expression literal is the simplest way to express a token. An alternative way, with more options, is by using an object literal. With that notation, the regular expression describing the token would be the pattern attribute:

-
...
-'tokenname': {
-	pattern: /regex/
-}
-...
-

So far the functionality is exactly the same between the short and extended notations. However, the extended notation allows for additional options:

+

At its core, a language definition is just a JavaScript object, and a token is just an entry of the language definition. The simplest language definition is an empty object:

+
Prism.languages['some-language'] = { };
+ +

Unfortunately, an empty language definition isn't very useful, so let's add a token. The simplest way to express a token is using a regular expression literal:

+
Prism.languages['some-language'] = {
+	'token-name': /regex/,
+};
+ +

Alternatively, an object literal can also be used. With this notation, the regular expression describing the token is the pattern property of the object:

+
Prism.languages['some-language'] = {
+	'token-name': {
+		pattern: /regex/
+	},
+};
+ +

So far, the functionality is exactly the same between the regex and object notations. However, the object notation allows for additional options. More on that later.

+ +

The name of a token can theoretically be any string that is also a valid CSS class, but there are some guidelines to follow. More on that later.

+ +

Language definitions can have any number of tokens, but the name of each token must be unique:

+
Prism.languages['some-language'] = {
+	'token-1': /I love regexes!/,
+	'token-2': /regex/,
+};
+ +

Prism will match tokens against the input text one after the other, in order, and tokens cannot overlap with the matches of previous tokens. So in the above example, token-2 will not match the substring "regex" inside of matches of token-1. More information about Prism's matching algorithm later.

+ +

Lastly, in many languages, there are multiple different ways of declaring the same constructs (e.g. comments, strings, ...) and sometimes it is difficult or unpractical to match all of them with one single regular expression. To add multiple regular expressions for one token name, an array can be used:

+
Prism.languages['some-language'] = {
+	'token-name': [
+		/regex 1/,
+		/regex 2/,
+		{ pattern: /regex 3/ }
+	],
+};
+ +

Note: An array cannot be used in the pattern property.

+ + +

Object notation

+ +

Instead of using just plain regular expressions, Prism also supports an object notation for tokens. This notation enables the following options:

-
inside
-
This property accepts another object literal, with tokens that are allowed to be nested in this token. - This makes it easier to define certain languages. However, keep in mind that they’re slower and if coded poorly, can even result in infinite recursion. - For an example of nested tokens, check out the Markup language definition: -
- -
lookbehind
-
This option mitigates JavaScript’s lack of lookbehind. When set to true, - the first capturing group in the regex pattern is discarded when matching this token, so it effectively behaves - as if it was lookbehind. For an example of this, check out the C-like language definition, in particular the comment and class-name tokens: -
- -
rest
-
Accepts an object literal with tokens and appends them to the end of the current object literal. Useful for referring to tokens defined elsewhere. For an example where rest is useful, check the Markup definitions above.
- -
alias
-
This option can be used to define one or more aliases for the matched token. The result will be, that - the styles of the token and its aliases are combined. This can be useful, to combine the styling of a standard - token, which is already supported by most of the themes, with a semantically correct token name. The option - can be set to a string literal or an array of string literals. In the following example the token - name latex-equation is not supported by any theme, but it will be highlighted the same as a string. -
{
+		
pattern: RegExp
+
+

This is the only required option. It holds the regular expression of the token.

+
+ +
lookbehind: boolean
+
+

This option mitigates JavaScript's poor browser support for lookbehinds. When set to true, the first capturing group in the pattern regex is discarded when matching this token, so it effectively functions as a lookbehind.

+ +

For an example of this, check out how the C-like language definition finds class-name tokens:

+
Prism.languages.clike = {
+	// ...
+	'class-name': {
+		pattern: /(\b(?:class|extends|implements|instanceof|interface|new|trait)\s+)\w+/i,
+		lookbehind: true
+	}
+};
+
+ +
greedy: boolean
+
+

This option enables greedy matching for the token. For more information, see the section about the matching algorithm.

+
+ +
alias: string | string[]
+
+

This option can be used to define one or more aliases for the token. The result will be that the styles of the token name and the alias(es) are combined. This can be useful to combine the styling of a standard token, which is already supported by most of the themes, with a more precise token name. For more information on this topic, see granular highlighting.

+ +

E.g. the token name latex-equation is not supported by most themes, but it will be highlighted the same as a string in the following example:

+
Prism.languages.latex = {
+	// ...
 	'latex-equation': {
-		pattern: /\$(\\?.)*?\$/g,
+		pattern: /\$.*?\$/,
 		alias: 'string'
 	}
-}
- -
greedy
-
This is a boolean attribute. It is intended to solve a common problem with - patterns that match long strings like comments, regex or string literals. For example, - comments are parsed first, but if the string /* foo */ - appears inside a string, you would not want it to be highlighted as a comment. - The greedy-property allows a pattern to ignore previous matches of other patterns, and - overwrite them when necessary. Use this flag with restraint, as it incurs a small performance overhead. - The following example demonstrates its usage: -
'string': {
-	pattern: /(["'])(\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,
-	greedy: true
-}
+};
+
+ +
inside: Grammar
+
+

This option accepts another object literal, with tokens that are allowed to be nested in this token. All tokens in the inside grammar will be encapsulated by this token. This makes it easier to define certain languages.

+ +

For an example of nested tokens, check out the url token in the CSS language definition:

+
Prism.languages.css = {
+	// ...
+	'url': {
+		// e.g. url(https://example.com)
+		pattern: /\burl\(.*?\)/i,
+		inside: {
+			'function': /^url/i,
+			'punctuation': /^\(|\)$/
+		}
+	}
+};
+ +

The inside option can also be used to create recursive languages. This is useful for languages where one token can contain arbitrary expressions, e.g. languages with a string interpolation syntax.

+ +

For example, here is how JavaScript implements template string interpolation:

+
Prism.languages.javascript = {
+	// ...
+	'template-string': {
+		pattern: /`(?:\\.|\$\{[^{}]*\}|(?!\$\{)[^\\`])*`/,
+		inside: {
+			'interpolation': {
+				pattern: /\$\{[^{}]*\}/,
+				inside: {
+					'punctuation': /^\$\{|\}$/,
+					'expression': {
+						pattern: /[\s\S]+/,
+						inside: null // see below
+					}
+				}
+			}
+		}
+	}
+};
+Prism.languages.javascript['template-string'].inside['interpolation'].inside['expression'].inside = Prism.languages.javascript;
+ +

Be careful when creating recursive grammars as they might lead to infinite recursion which will cause a stack overflow.

+
-

Unless explicitly allowed through the inside property, each token cannot contain other tokens, so their order is significant. Although per the ECMAScript specification, objects are not required to have a specific ordering of their properties, in practice they do in every modern browser.

-

In most languages there are multiple different ways of declaring the same constructs (e.g. comments, strings, ...) and sometimes it is difficult or unpractical to match all of them with one single regular expression. To add multiple regular expressions for one token name an array can be used:

+

Token names

+ +

The name of a token determines the semantic meaning of matched text of the token. Tokens can capture anything from simple language constructs, like comments, to more complex ones, like template string interpolation expressions. Token names differentiate these language constructs.

+ +

A token name can theoretically be any string that is a valid CSS class name. However, in practice, it makes sense for token names to follow some rules. In Prism's code, we enforce that all token names use kebab case (foo-bar) and contain only lower-case ASCII letters, digits, and hyphen characters. E.g. class-name is allowed but Class_name is not.

+ +

Prism also defines some standard tokens names that should be used for most tokens.

+ +

Themes

+ +

Prism's themes assign color (and other styles) to tokens based on their name (and aliases). This means that the language definition does not control the color of tokens, themes do.

+ +

However, themes only support a limited number of known token names. If a theme does not know a particular token name, no styles will be applied. While different themes may support different token names, all themes are guaranteed to support Prism's standard tokens. Standard tokens as special token names with specific semantic meanings. They are the common ground all language definitions and themes agree on and must follow. Standard tokens should be preferred when choosing token names.

+ +

Granular highlighting

+ +

While standard tokens should be the preferred choice, they are also quite general. This is by design as they have to apply to a large number and variety of different languages, but sometimes more fine-grained tokenization (and subsequent highlighting) is desirable.

+ +

Granular highlighting is a method of choosing token names to enable fine control for themes, while also ensuring compatibility with all themes.

+ +

Let's look at an example. Say we had a language that supported both decimal and binary literals for numbers, and we wanted to give binary number special highlighting. We might implement it like this:

+
Prism.languages['my-language'] = {
+	// ...
+	'number': /\b\d+(?:\.\d+)?\b/,
+	'binary-number': /\b0b[01]+\b/,
+};
+ +

But this has a problem. binary-number is not a standard token, so almost no theme is going to given binary numbers any color.

+ +

The solution to this problem is to use an alias:

+
Prism.languages['my-language'] = {
+	// ...
+	'number': /\b\d+(?:\.\d+)?\b/,
+	'binary-number': {
+		pattern: /\b0b[01]+\b/,
+		alias: 'number'
+	},
+};
+ +

Aliases allow themes to apply the styles of multiple names to one token. This means that themes that do support the binary-number token name can assign a special color, and themes don't support it will fallback to their usual color for numbers.

+ +

This is granular highlighting: using a non-standard token name and a standard token as an alias.

+ + +

The matching algorithm

+ +

The job of Prism's matching algorithm is to produce a token stream given a language definition and some text. A token stream is Prism's representation of (partially or fully) tokenized text and is implemented as a list of strings (representing literal text) and tokens (representing tokenized text).

+ +

Note: The word "token" is ambiguous here. We use "token" to refer to both the entry of a language definition (as described in above sections) and a Token object inside a token stream. Which type of "token" is meant can be inferred from context.

+ +

The simplified token stream notation will be used in this section. Briefly, the notation uses JSON to represent a token stream. E.g. ["foo ", ["keyword", "bar"], " baz"] is the simplified token stream notation for the token stream that starts with the string foo , is followed by a token of type keyword and text bar, and ends with the string baz.

+ +

Back to the matching algorithm: Prism's matching algorithm is a hybrid with two modes: first-come, first-served (FCFS) matching and greedy matching.

+ +

FCFS matching

+ +

This is Prism default matching mode. All tokens are matched one after the other, in order, tokens cannot overlap, and tokens cannot match text that is already matched by previous tokens.

+ +

The algorithm itself is quite simple. Let's say we wanted to tokenize the JS code max(3, 5, exp2(7)); and that function tokens had already been processed. The current token stream would be:

+
[
+	["function", "max"],
+	"(3, 5, ",
+	["function", "exp2"],
+	"(7));"
+]
+ +

Next, we would tokenize numbers with the token 'number': /[0-9]+/.

+ +

FCFS matching will go through all strings in the current token stream to find matches for the number regex. The first string is "(3, 5, ", so the match 3 is found. A new token is created for 3 and inserted into the token stream to replace the matching text. The token stream is now:

+
[
+	["function", "max"],
+	"(",
+	["number", "3"],
+	", 5, ",
+	["function", "exp2"],
+	"(7));"
+]
+ +

Now, the algorithm goes to the next string ", 5, " and finds another match. A new token is created for 5 and the token stream is now:

+
[
+	["function", "max"],
+	"(",
+	["number", "3"],
+	", ",
+	["number", "5"],
+	", ",
+	["function", "exp2"],
+	"(7));"
+]
-
...
-'tokenname': [ /regex0/, /regex1/, { pattern: /regex2/ } ]
-...
+

The next string is ", " and no matches are found. The string after that is "(7));" and a new token is create for 7: +

[
+	["function", "max"],
+	"(",
+	["number", "3"],
+	", ",
+	["number", "5"],
+	", ",
+	["function", "exp2"],
+	"(",
+	["number", "7"],
+	"));"
+]
+ +

The last string to check is "));" and no matches are found. The number token has now been processed and the algorithm will go process the next token in the language definition.

+ +

Notice how FCFS matching did not find the 2 in exp2. Since FCFS matching completely ignores existing tokens in the token stream, the number regex cannot see already-tokenized text. This is a very useful property. In the above example, 2 is a part of the function name exp2, so highlighting it as a number would be incorrect.

+ +

Greedy matching

+ +

Greedy matching is very similar to FCFS matching. All tokens are matched in order and tokens cannot overlap. The defining difference is that greedy tokens can match the text of previous tokens.

+ +

Let's look at an example to see why greedy matching is useful and how it works conceptually. A very simplified version of JavaScript's comment and string syntax might be implemented like this:

+
Prism.languages.javascript = {
+	'comment': /\/\/.*/,
+	'string': /'(?:\\.|[^\\\r\n])*'/
+};
+ +

To understand why greedy matching is useful, let's look at how FCFS matching would tokenize the text 'http://example.com':

+ +

FCFS matching starts with the token stream ["'http://example.com'"] and tries to find matches for 'comment': /\/\/.*/. The match //example.com' is found and inserted into the token stream:

+
[
+	"'http:",
+	["comment", "//example.com'"]
+]
+ +

Then FCFS matching will search for matches for 'string': /'(?:\\.|[^'\\\r\n])*'/. The first string of the token stream, "'http:", does not match the string regex, so the token stream remains unchanged. The string token has now been processed and the above token stream is the final result.

+ +

Obviously, this is bad. The code 'http://example.com' is clearly just a string containing a URL, but FCFS matching doesn't understand this.

+ +

An obvious, but incorrect, fix might be to swap the order of comment and string. This would fix 'http://example.com'. However, the problem was merely moved. Comments like // it's my co-worker's code (note the two single quotes) will now be tokenized incorrectly.

+ +

This is the problem greedy matching solves. Let's make the tokens greedy and then see how this affects the result:

+
Prism.languages.javascript = {
+	'comment': {
+		pattern: /\/\/.*/,
+		greedy: true
+	},
+	'string': {
+		pattern: /'(?:\\.|[^'\\\r\n])*'/,
+		greedy: true
+	}
+};
+ +

While the actual greedy matching algorithm is quite complex and littered with subtle edge cases, its effect quite simple: a list of greedy tokens will behave as if they were matched by a single regex. This is how greedy matching works conceptually and how you should think about greedy tokens.

+ +

This means that the greedy comment and string tokens will behave like the following language definition, but the combined token will result in the correct token names of the original greedy tokens:

+
Prism.languages.javascript = {
+	'comment-or-string': /\/\/.*|'(?:\\.|[^'\\\r\n])*'/
+};
+ +

In the above example, 'http://example.com' will be matched by /\/\/.*|'(?:\\.|[^'\\\r\n])*'/ completely. Since the '(?:\\.|[^'\\\r\n])*' part of the regex caused the match, a token of type string will be created and the following token stream will be produced:

+
[
+	["string", "'http://example.com'"]
+]
+ +

Similarly, the tokenization will also be correct for the // it's my co-worker's code example.

+ +

When deciding whether a token should be greedy, use the following guide lines:

+ +
    +
  1. +

    Most tokens are not greedy.

    + +

    Most tokens in most languages are not greedy, because they don't need to be. Typically only the comment, string, and regex literal tokens need to be greedy. All other tokens can use FCFS matching.

    + +

    Generally, a token should only be greedy if it can contain the start of another token.

    +
  2. +
  3. +

    All tokens before a greedy token should also be greedy.

    + +

    Greedy matching works subtly differently if there are non-greedy tokens before a greedy token. This typically leads to subtle and hard-to-catch bugs that sometimes take years to uncover.

    + +

    To make sure that greedy matching works as expected, the greedy tokens should be the first tokens of a language.

    +
  4. +
  5. +

    Greedy tokens come in groups.

    + +

    If a language definition contains only a single greedy token, then the greedy token shouldn't be greedy. As explained above, greedy matching conceptually combines the regexes of all greedy tokens into one. If there is only one greedy token, greedy matching will behave like FCFS matching.

    +
  6. +

Helper functions

Prism also provides some useful function for creating and modifying language definitions. Prism.languages.insertBefore can be used to modify existing languages definitions. Prism.languages.extend is useful for when your language is very similar to another existing language.

+ + +

The rest property

+ +

The rest property in language definitions is special. Prism expects this property to be another language definition instead of a token. The tokens of the grammar in the rest property will be appended to the end of the language definition with the rest property. It can be thought of as a built-in object spread operator.

+ +

This is useful for referring to tokens defined elsewhere. However, the rest property should be used sparingly. When referencing another language, it is typically better to encapsulate the text of the language into a token and use the inside property instead.

@@ -120,7 +377,7 @@

Creating a new language definition

"owner": "Your GitHub name" } -

If your language definition depends any other languages, you have to specify this here as well by adding a "require" property. E.g. "require": "clike", or "require" : ["markup", "css"]. For more information on dependencies read the declaring dependencies section.

+

If your language definition depends any other languages, you have to specify this here as well by adding a "require" property. E.g. "require": "clike", or "require": ["markup", "css"]. For more information on dependencies read the declaring dependencies section.

Note: Any changes made to components.json require a rebuild (see step 3).