Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Tokenizers lex their own child tokens (#2124)
BREAKING CHANGES:

- Tokenizers will create their own tokens with `this.lexer.inline(text, tokens)`. The `inline` function will queue the token creation until after all block tokens are rendered.
- `nptable` tokenizer is removed and merged with `table` tokenizer.
- Extensions tokenizer `this` object will include the `lexer` as a property. `this.inlineTokens` becomes `this.lexer.inline`.
- Extensions parser `this` object will include the `parser` as a property. `this.parseInline` becomes `this.parser.parseInline`.
- `tag` and `inlineText` tokenizer function signatures have changed.
  • Loading branch information
calculuschild committed Aug 2, 2021
1 parent 20bda6e commit 288f1cb
Show file tree
Hide file tree
Showing 9 changed files with 206 additions and 302 deletions.
44 changes: 25 additions & 19 deletions docs/USING_PRO.md
Expand Up @@ -226,7 +226,7 @@ console.log(marked('$ latex code $\n\n` other code `'));
### Inline level tokenizer methods

- <code>**escape**(*string* src)</code>
- <code>**tag**(*string* src, *bool* inLink, *bool* inRawBlock)</code>
- <code>**tag**(*string* src)</code>
- <code>**link**(*string* src)</code>
- <code>**reflink**(*string* src, *object* links)</code>
- <code>**emStrong**(*string* src, *string* maskedSrc, *string* prevChar)</code>
Expand All @@ -235,7 +235,7 @@ console.log(marked('$ latex code $\n\n` other code `'));
- <code>**del**(*string* src)</code>
- <code>**autolink**(*string* src, *function* mangle)</code>
- <code>**url**(*string* src, *function* mangle)</code>
- <code>**inlineText**(*string* src, *bool* inRawBlock, *function* smartypants)</code>
- <code>**inlineText**(*string* src, *function* smartypants)</code>

`mangle` is a method that changes text to HTML character references:

Expand Down Expand Up @@ -331,11 +331,15 @@ The returned token can also contain any other custom parameters of your choice t
The tokenizer function has access to the lexer in the `this` object, which can be used if any internal section of the string needs to be parsed further, such as in handling any inline syntax on the text within a block token. The key functions that may be useful include:

<dl>
<dt><code><strong>this.blockTokens</strong>(<i>string</i> text)</code></dt>
<dd>Runs the block tokenizer functions (including any extensions) on the provided text, and returns an array containing a nested tree of tokens.</dd>
<dt><code><strong>this.lexer.blockTokens</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
<dd>This runs the block tokenizer functions (including any block-level extensions) on the provided text, and appends any resulting tokens onto the <code>tokens</code> array. The <code>tokens</code> array is also returned by the function. You might use this, for example, if your extension creates a "container"-type token (such as a blockquote) that can potentially include other block-level tokens inside.</dd>

<dt><code><strong>this.inlineTokens</strong>(<i>string</i> text)</code></dt>
<dd>Runs the inline tokenizer functions (including any extensions) on the provided text, and returns an array containing a nested tree of tokens. This can be used to generate the <code>tokens</code> parameter.</dd>
<dl>
<dt><code><strong>this.lexer.inline</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
<dd>Parsing of inline-level tokens only occurs after all block-level tokens have been generated. This function adds <code>text</code> and <code>tokens</code> to a queue to be processed using inline-level tokenizers (including any inline-level extensions) at that later step. Tokens will be generated using the provided <code>text</code>, and any resulting tokens will be appended to the <code>tokens</code> array. Note that this function does **NOT** return anything since the inline processing cannot happen until the block-level processing is complete.</dd>

<dt><code><strong>this.lexer.inlineTokens</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
<dd>Sometimes an inline-level token contains further nested inline tokens (such as a <pre><code>**strong**</code></pre> token inside of a <pre><code>### Heading</code></pre>). This runs the inline tokenizer functions (including any inline-level extensions) on the provided text, and appends any resulting tokens onto the <code>tokens</code> array. The <code>tokens</code> array is also returned by the function.</dd>
</dl>

<dt><code><strong>renderer</strong>(<i>object</i> token)</code></dt>
Expand All @@ -344,11 +348,11 @@ The tokenizer function has access to the lexer in the `this` object, which can b
The renderer function has access to the parser in the `this` object, which can be used if any part of the token needs needs to be parsed further, such as any child tokens. The key functions that may be useful include:

<dl>
<dt><code><strong>this.parse</strong>(<i>array</i> tokens)</code></dt>
<dd>Runs the block renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output.</dd>
<dt><code><strong>this.parser.parse</strong>(<i>array</i> tokens)</code></dt>
<dd>Runs the block renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This is used to generate the HTML from any child block-level tokens, for example if your extension is a "container"-type token (such as a blockquote) that can potentially include other block-level tokens inside.</dd>

<dt><code><strong>this.parseInline</strong>(<i>array</i> tokens)</code></dt>
<dd>Runs the inline renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This could be used to generate text from any child tokens, for example.</dd>
<dt><code><strong>this.parser.parseInline</strong>(<i>array</i> tokens)</code></dt>
<dd>Runs the inline renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This is used to generate the HTML from any child inline-level tokens.</dd>
</dl>

</dd>
Expand All @@ -371,16 +375,18 @@ const descriptionlist = {
const rule = /^(?::[^:\n]+:[^:\n]*(?:\n|$))+/; // Regex for the complete token
const match = rule.exec(src);
if (match) {
return { // Token to generate
const token = { // Token to generate
type: 'descriptionList', // Should match "name" above
raw: match[0], // Text to consume from the source
text: match[0].trim(), // Additional custom properties
tokens: this.inlineTokens(match[0].trim()) // inlineTokens to process **bold**, *italics*, etc.
tokens: [] // Array where child inline tokens will be generated
};
this.lexer.inline(token.text, token.tokens); // Queue this data to be processed for inline tokens
return token;
}
},
renderer(token) {
return `<dl>${this.parseInline(token.tokens)}\n</dl>`; // parseInline to turn child tokens into HTML
return `<dl>${this.parser.parseInline(token.tokens)}\n</dl>`; // parseInline to turn child tokens into HTML
}
};

Expand All @@ -392,16 +398,16 @@ const description = {
const rule = /^:([^:\n]+):([^:\n]*)(?:\n|$)/; // Regex for the complete token
const match = rule.exec(src);
if (match) {
return { // Token to generate
type: 'description', // Should match "name" above
raw: match[0], // Text to consume from the source
dt: this.inlineTokens(match[1].trim()), // Additional custom properties
dd: this.inlineTokens(match[2].trim())
return { // Token to generate
type: 'description', // Should match "name" above
raw: match[0], // Text to consume from the source
dt: this.lexer.inlineTokens(match[1].trim()), // Additional custom properties, including
dd: this.lexer.inlineTokens(match[2].trim()) // any further-nested inline tokens
};
}
},
renderer(token) {
return `\n<dt>${this.parseInline(token.dt)}</dt><dd>${this.parseInline(token.dd)}</dd>`;
return `\n<dt>${this.parser.parseInline(token.dt)}</dt><dd>${this.parser.parseInline(token.dd)}</dd>`;
},
childTokens: ['dt', 'dd'], // Any child tokens to be visited by walkTokens
walkTokens(token) { // Post-processing on the completed token tree
Expand Down
134 changes: 34 additions & 100 deletions src/Lexer.js
Expand Up @@ -55,6 +55,13 @@ module.exports = class Lexer {
this.options.tokenizer = this.options.tokenizer || new Tokenizer();
this.tokenizer = this.options.tokenizer;
this.tokenizer.options = this.options;
this.tokenizer.lexer = this;
this.inlineQueue = [];
this.state = {
inLink: false,
inRawBlock: false,
top: true
};

const rules = {
block: block.normal,
Expand Down Expand Up @@ -109,27 +116,30 @@ module.exports = class Lexer {
.replace(/\r\n|\r/g, '\n')
.replace(/\t/g, ' ');

this.blockTokens(src, this.tokens, true);
this.blockTokens(src, this.tokens);

this.inline(this.tokens);
let next;
while (next = this.inlineQueue.shift()) {
this.inlineTokens(next.src, next.tokens);
}

return this.tokens;
}

/**
* Lexing
*/
blockTokens(src, tokens = [], top = true) {
blockTokens(src, tokens = []) {
if (this.options.pedantic) {
src = src.replace(/^ +$/gm, '');
}
let token, i, l, lastToken, cutSrc, lastParagraphClipped;
let token, lastToken, cutSrc, lastParagraphClipped;

while (src) {
if (this.options.extensions
&& this.options.extensions.block
&& this.options.extensions.block.some((extTokenizer) => {
if (token = extTokenizer.call(this, src, tokens)) {
if (token = extTokenizer.call({ lexer: this }, src, tokens)) {
src = src.substring(token.raw.length);
tokens.push(token);
return true;
Expand All @@ -156,6 +166,8 @@ module.exports = class Lexer {
if (lastToken && lastToken.type === 'paragraph') {
lastToken.raw += '\n' + token.raw;
lastToken.text += '\n' + token.text;
this.inlineQueue.pop();
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
} else {
tokens.push(token);
}
Expand All @@ -176,13 +188,6 @@ module.exports = class Lexer {
continue;
}

// table no leading pipe (gfm)
if (token = this.tokenizer.nptable(src)) {
src = src.substring(token.raw.length);
tokens.push(token);
continue;
}

// hr
if (token = this.tokenizer.hr(src)) {
src = src.substring(token.raw.length);
Expand All @@ -193,18 +198,13 @@ module.exports = class Lexer {
// blockquote
if (token = this.tokenizer.blockquote(src)) {
src = src.substring(token.raw.length);
token.tokens = this.blockTokens(token.text, [], top);
tokens.push(token);
continue;
}

// list
if (token = this.tokenizer.list(src)) {
src = src.substring(token.raw.length);
l = token.items.length;
for (i = 0; i < l; i++) {
token.items[i].tokens = this.blockTokens(token.items[i].text, [], false);
}
tokens.push(token);
continue;
}
Expand All @@ -217,7 +217,7 @@ module.exports = class Lexer {
}

// def
if (top && (token = this.tokenizer.def(src))) {
if (this.state.top && (token = this.tokenizer.def(src))) {
src = src.substring(token.raw.length);
if (!this.tokens.links[token.tag]) {
this.tokens.links[token.tag] = {
Expand Down Expand Up @@ -250,18 +250,20 @@ module.exports = class Lexer {
const tempSrc = src.slice(1);
let tempStart;
this.options.extensions.startBlock.forEach(function(getStartIndex) {
tempStart = getStartIndex.call(this, tempSrc);
tempStart = getStartIndex.call({ lexer: this }, tempSrc);
if (typeof tempStart === 'number' && tempStart >= 0) { startIndex = Math.min(startIndex, tempStart); }
});
if (startIndex < Infinity && startIndex >= 0) {
cutSrc = src.substring(0, startIndex + 1);
}
}
if (top && (token = this.tokenizer.paragraph(cutSrc))) {
if (this.state.top && (token = this.tokenizer.paragraph(cutSrc))) {
lastToken = tokens[tokens.length - 1];
if (lastParagraphClipped && lastToken.type === 'paragraph') {
lastToken.raw += '\n' + token.raw;
lastToken.text += '\n' + token.text;
this.inlineQueue.pop();
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
} else {
tokens.push(token);
}
Expand All @@ -277,6 +279,8 @@ module.exports = class Lexer {
if (lastToken && lastToken.type === 'text') {
lastToken.raw += '\n' + token.raw;
lastToken.text += '\n' + token.text;
this.inlineQueue.pop();
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
} else {
tokens.push(token);
}
Expand All @@ -294,78 +298,18 @@ module.exports = class Lexer {
}
}

this.state.top = true;
return tokens;
}

inline(tokens) {
let i,
j,
k,
l2,
row,
token;

const l = tokens.length;
for (i = 0; i < l; i++) {
token = tokens[i];
switch (token.type) {
case 'paragraph':
case 'text':
case 'heading': {
token.tokens = [];
this.inlineTokens(token.text, token.tokens);
break;
}
case 'table': {
token.tokens = {
header: [],
cells: []
};

// header
l2 = token.header.length;
for (j = 0; j < l2; j++) {
token.tokens.header[j] = [];
this.inlineTokens(token.header[j], token.tokens.header[j]);
}

// cells
l2 = token.cells.length;
for (j = 0; j < l2; j++) {
row = token.cells[j];
token.tokens.cells[j] = [];
for (k = 0; k < row.length; k++) {
token.tokens.cells[j][k] = [];
this.inlineTokens(row[k], token.tokens.cells[j][k]);
}
}

break;
}
case 'blockquote': {
this.inline(token.tokens);
break;
}
case 'list': {
l2 = token.items.length;
for (j = 0; j < l2; j++) {
this.inline(token.items[j].tokens);
}
break;
}
default: {
// do nothing
}
}
}

return tokens;
inline(src, tokens) {
this.inlineQueue.push({ src, tokens });
}

/**
* Lexing/Compiling
*/
inlineTokens(src, tokens = [], inLink = false, inRawBlock = false) {
inlineTokens(src, tokens = []) {
let token, lastToken, cutSrc;

// String with links masked to avoid interference with em and strong
Expand Down Expand Up @@ -404,7 +348,7 @@ module.exports = class Lexer {
if (this.options.extensions
&& this.options.extensions.inline
&& this.options.extensions.inline.some((extTokenizer) => {
if (token = extTokenizer.call(this, src, tokens)) {
if (token = extTokenizer.call({ lexer: this }, src, tokens)) {
src = src.substring(token.raw.length);
tokens.push(token);
return true;
Expand All @@ -422,10 +366,8 @@ module.exports = class Lexer {
}

// tag
if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
if (token = this.tokenizer.tag(src)) {
src = src.substring(token.raw.length);
inLink = token.inLink;
inRawBlock = token.inRawBlock;
lastToken = tokens[tokens.length - 1];
if (lastToken && token.type === 'text' && lastToken.type === 'text') {
lastToken.raw += token.raw;
Expand All @@ -439,9 +381,6 @@ module.exports = class Lexer {
// link
if (token = this.tokenizer.link(src)) {
src = src.substring(token.raw.length);
if (token.type === 'link') {
token.tokens = this.inlineTokens(token.text, [], true, inRawBlock);
}
tokens.push(token);
continue;
}
Expand All @@ -450,10 +389,7 @@ module.exports = class Lexer {
if (token = this.tokenizer.reflink(src, this.tokens.links)) {
src = src.substring(token.raw.length);
lastToken = tokens[tokens.length - 1];
if (token.type === 'link') {
token.tokens = this.inlineTokens(token.text, [], true, inRawBlock);
tokens.push(token);
} else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
if (lastToken && token.type === 'text' && lastToken.type === 'text') {
lastToken.raw += token.raw;
lastToken.text += token.text;
} else {
Expand All @@ -465,7 +401,6 @@ module.exports = class Lexer {
// em & strong
if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
src = src.substring(token.raw.length);
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
tokens.push(token);
continue;
}
Expand All @@ -487,7 +422,6 @@ module.exports = class Lexer {
// del (gfm)
if (token = this.tokenizer.del(src)) {
src = src.substring(token.raw.length);
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
tokens.push(token);
continue;
}
Expand All @@ -500,7 +434,7 @@ module.exports = class Lexer {
}

// url (gfm)
if (!inLink && (token = this.tokenizer.url(src, mangle))) {
if (!this.state.inLink && (token = this.tokenizer.url(src, mangle))) {
src = src.substring(token.raw.length);
tokens.push(token);
continue;
Expand All @@ -514,14 +448,14 @@ module.exports = class Lexer {
const tempSrc = src.slice(1);
let tempStart;
this.options.extensions.startInline.forEach(function(getStartIndex) {
tempStart = getStartIndex.call(this, tempSrc);
tempStart = getStartIndex.call({ lexer: this }, tempSrc);
if (typeof tempStart === 'number' && tempStart >= 0) { startIndex = Math.min(startIndex, tempStart); }
});
if (startIndex < Infinity && startIndex >= 0) {
cutSrc = src.substring(0, startIndex + 1);
}
}
if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
if (token = this.tokenizer.inlineText(cutSrc, smartypants)) {
src = src.substring(token.raw.length);
if (token.raw.slice(-1) !== '_') { // Track prevChar before string of ____ started
prevChar = token.raw.slice(-1);
Expand Down

1 comment on commit 288f1cb

@vercel
Copy link

@vercel vercel bot commented on 288f1cb Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.