[WIP] extensible Parser: separate methods for token types #2038

mitranim · 2021-05-03T08:51:54Z

This branch is not yet ready for merge; see known regressions below.

Logic for different token types has been moved from switch cases to methods. The main parse loop now simply looks up the method by token type, falling back on default().

Token types are now an open set, rather than a closed set. New token types can be supported by subclassing Parser and adding corresponding methods, or by overriding default().

Consolidated the parsing logic for "top" and "inline" tokens. Removed parser.parseInline. The top-level function parseInline uses lexer.lexInline to generate an appropriate AST; additional restrictions at the parser level appear redundant.

More flexible approach to "contextual" parsing logic, such as using a different renderer or indicating "loose" mode. Now every parse method takes an additional parameter: a "context", which contains a renderer and possibly other settings. By default, such context objects are allocated lazily and no more than once per Parser. The method signatures make it possible to use context inheritance, where child contexts inherit from parent contexts, overriding only some of their properties. This is not used by default, but can be useful for advanced cases such as one described in #2033.

Benchmarks indicate no significant performance regressions or improvements. Differences appear to be within noise levels.

Known regressions that must be fixed before merging the MR:

No support for coalescing adjacent text nodes.
List "loose" mode, with paragraph wrapping, is now a placeholder with incorrect logic.
Fewer tests pass, presumably as a result of ↑.

Addresses #2033.

Requesting feedback on:

Good ways to recover the "text coalesce" functionality, where adjacent text nodes would be joined by \n.
Good ways to recover the feature where in loose-mode lists text would sometimes be wrapped into paragraphs.
Possibly anything else I've overlooked.

The reason for the above regression is that the main parse loop has been split, is no longer "aware" of token types, and has no special cases. I really want to keep it this way.

Logic for different token types has been moved from switch cases to methods. The main parse loop now simply looks up the method by token type, falling back on `default()`. Token types are now an open set, rather than a closed set. New token types can be supported by subclassing Parser and adding corresponding methods, or by overriding `default()`. Consolidated the parsing logic for "top" and "inline" tokens. Removed `parser.parseInline`. The top-level function `parseInline` uses `lexer.lexInline` to generate an appropriate AST; additional restrictions at the parser level appear redundant. More flexible approach to "contextual" parsing logic, such as using a different renderer or indicating "loose" mode. Now every parse method takes an additional parameter: a "context", which contains a renderer and possibly other settings. By default, such context objects are allocated lazily and no more than once. The method signatures make it possible to use context inheritance, where child contexts inherit from parent contexts, overriding only some of their properties. This is not used by default, but can be useful for advanced cases such as one described in markedjs#2033. Benchmarks indicate no significant performance regressions or improvements. Differences appear to be within noise levels. Known regressions that must be fixed before merging the MR: * No support for coalescing adjacent `text` nodes. * List "loose" mode, with paragraph wrapping, is now a placeholder with incorrect logic. * Fewer tests pass, presumably as a result of ↑. Addresses [markedjs#2033](markedjs#2033).

vercel · 2021-05-03T08:51:59Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/B9mki6vjUCbjahZEMcRq1rDyQu1Q
✅ Preview: https://markedjs-git-fork-mitranim-master-markedjs.vercel.app

UziTech · 2021-05-03T15:17:33Z

src/Parser.js

+  }
+
+  // Regression; missing features:
+  //   * Join adjacent text tokens with '\n'.


Could we add tokens and currentIndex to the context object so the text function could combine tokens the same way it does in the switch statement?

Unless I misunderstood, sounds like it involves mutating the context. Parsing/compiling any token, even text, is potentially recursive, with nested parse loops. To avoid trampling each other's tokens and currentIndex in the context, parse loops would have to either make new context objects (slow), or stash some contextual values in the stack frame of the current parse loop, and restore them before returning from the loop (shouldn't affect performance, but looks dirty). This also creates a "communication channel" between a parse loop and the methods it's calling, though a shared mutable object. This could work, but a cleaner approach would be much appreciated.

I admit this is a self-inflicted problem. The "old" loop had the power to handle this case and worked just fine. But I also feel that making the parse loop really dumb, where it just calls one method per token, "buys" us nice extensibility through method overrides, and is at least worth exploring.

I don't entirely understand the reasoning behind this special case. Why would there be adjacent text tokens? Is this currently generated anywhere inside Marked, or supported for outside customization? Could this be handled better at the Lexer level? Could it be handled better through text.tokens?

"buys" us nice extensibility through method overrides

If the goal is just to have extensibility it seems like it would be easier to just have it check for function in the default case.

but a cleaner approach would be much appreciated

Clean code isn't really what marked is aiming for. Clean code often comes with slow performance. We try to make it as clean as possible without sacrificing performance at all. The switch statement may be ugly but it works and is as fast as it can be. Perhaps the goal of a "dumb" parse loop is the wrong goal.

UziTech · 2021-06-20T19:43:06Z

I think #2043 fixes the issue this PR would solve

vercel bot deployed to Preview May 3, 2021 08:51 View deployment

UziTech reviewed May 3, 2021

View reviewed changes

UziTech mentioned this pull request May 3, 2021

Pass whole token to renderer to increase customization possibilities #2039

Closed

UziTech closed this Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] extensible Parser: separate methods for token types #2038

[WIP] extensible Parser: separate methods for token types #2038

mitranim commented May 3, 2021 •

edited

vercel bot commented May 3, 2021 •

edited

UziTech May 3, 2021

mitranim May 4, 2021

UziTech May 4, 2021

UziTech commented Jun 20, 2021

[WIP] extensible Parser: separate methods for token types #2038

[WIP] extensible Parser: separate methods for token types #2038

Conversation

mitranim commented May 3, 2021 • edited

vercel bot commented May 3, 2021 • edited

UziTech May 3, 2021

Choose a reason for hiding this comment

mitranim May 4, 2021

Choose a reason for hiding this comment

UziTech May 4, 2021

Choose a reason for hiding this comment

UziTech commented Jun 20, 2021

mitranim commented May 3, 2021 •

edited

vercel bot commented May 3, 2021 •

edited