-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] extensible Parser: separate methods for token types #2038
Conversation
Logic for different token types has been moved from switch cases to methods. The main parse loop now simply looks up the method by token type, falling back on `default()`. Token types are now an open set, rather than a closed set. New token types can be supported by subclassing Parser and adding corresponding methods, or by overriding `default()`. Consolidated the parsing logic for "top" and "inline" tokens. Removed `parser.parseInline`. The top-level function `parseInline` uses `lexer.lexInline` to generate an appropriate AST; additional restrictions at the parser level appear redundant. More flexible approach to "contextual" parsing logic, such as using a different renderer or indicating "loose" mode. Now every parse method takes an additional parameter: a "context", which contains a renderer and possibly other settings. By default, such context objects are allocated lazily and no more than once. The method signatures make it possible to use context inheritance, where child contexts inherit from parent contexts, overriding only some of their properties. This is not used by default, but can be useful for advanced cases such as one described in markedjs#2033. Benchmarks indicate no significant performance regressions or improvements. Differences appear to be within noise levels. Known regressions that must be fixed before merging the MR: * No support for coalescing adjacent `text` nodes. * List "loose" mode, with paragraph wrapping, is now a placeholder with incorrect logic. * Fewer tests pass, presumably as a result of ↑. Addresses [markedjs#2033](markedjs#2033).
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/markedjs/markedjs/B9mki6vjUCbjahZEMcRq1rDyQu1Q |
} | ||
|
||
// Regression; missing features: | ||
// * Join adjacent text tokens with '\n'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add tokens
and currentIndex
to the context object so the text
function could combine tokens the same way it does in the switch statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I misunderstood, sounds like it involves mutating the context. Parsing/compiling any token, even text
, is potentially recursive, with nested parse loops. To avoid trampling each other's tokens
and currentIndex
in the context, parse loops would have to either make new context objects (slow), or stash some contextual values in the stack frame of the current parse loop, and restore them before returning from the loop (shouldn't affect performance, but looks dirty). This also creates a "communication channel" between a parse loop and the methods it's calling, though a shared mutable object. This could work, but a cleaner approach would be much appreciated.
I admit this is a self-inflicted problem. The "old" loop had the power to handle this case and worked just fine. But I also feel that making the parse loop really dumb, where it just calls one method per token, "buys" us nice extensibility through method overrides, and is at least worth exploring.
I don't entirely understand the reasoning behind this special case. Why would there be adjacent text tokens? Is this currently generated anywhere inside Marked, or supported for outside customization? Could this be handled better at the Lexer level? Could it be handled better through text.tokens
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"buys" us nice extensibility through method overrides
If the goal is just to have extensibility it seems like it would be easier to just have it check for function in the default
case.
but a cleaner approach would be much appreciated
Clean code isn't really what marked is aiming for. Clean code often comes with slow performance. We try to make it as clean as possible without sacrificing performance at all. The switch statement may be ugly but it works and is as fast as it can be. Perhaps the goal of a "dumb" parse loop is the wrong goal.
I think #2043 fixes the issue this PR would solve |
This branch is not yet ready for merge; see known regressions below.
Logic for different token types has been moved from switch cases to methods. The main parse loop now simply looks up the method by token type, falling back on
default()
.Token types are now an open set, rather than a closed set. New token types can be supported by subclassing Parser and adding corresponding methods, or by overriding
default()
.Consolidated the parsing logic for "top" and "inline" tokens. Removed
parser.parseInline
. The top-level functionparseInline
useslexer.lexInline
to generate an appropriate AST; additional restrictions at the parser level appear redundant.More flexible approach to "contextual" parsing logic, such as using a different renderer or indicating "loose" mode. Now every parse method takes an additional parameter: a "context", which contains a renderer and possibly other settings. By default, such context objects are allocated lazily and no more than once per Parser. The method signatures make it possible to use context inheritance, where child contexts inherit from parent contexts, overriding only some of their properties. This is not used by default, but can be useful for advanced cases such as one described in #2033.
Benchmarks indicate no significant performance regressions or improvements. Differences appear to be within noise levels.
Known regressions that must be fixed before merging the MR:
text
nodes.Addresses #2033.
Requesting feedback on:
\n
.The reason for the above regression is that the main parse loop has been split, is no longer "aware" of token types, and has no special cases. I really want to keep it this way.