Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] extensible Parser: separate methods for token types #2038

Closed
wants to merge 1 commit into from

Conversation

mitranim
Copy link

@mitranim mitranim commented May 3, 2021

This branch is not yet ready for merge; see known regressions below.

Logic for different token types has been moved from switch cases to methods. The main parse loop now simply looks up the method by token type, falling back on default().

Token types are now an open set, rather than a closed set. New token types can be supported by subclassing Parser and adding corresponding methods, or by overriding default().

Consolidated the parsing logic for "top" and "inline" tokens. Removed parser.parseInline. The top-level function parseInline uses lexer.lexInline to generate an appropriate AST; additional restrictions at the parser level appear redundant.

More flexible approach to "contextual" parsing logic, such as using a different renderer or indicating "loose" mode. Now every parse method takes an additional parameter: a "context", which contains a renderer and possibly other settings. By default, such context objects are allocated lazily and no more than once per Parser. The method signatures make it possible to use context inheritance, where child contexts inherit from parent contexts, overriding only some of their properties. This is not used by default, but can be useful for advanced cases such as one described in #2033.

Benchmarks indicate no significant performance regressions or improvements. Differences appear to be within noise levels.

Known regressions that must be fixed before merging the MR:

  • No support for coalescing adjacent text nodes.
  • List "loose" mode, with paragraph wrapping, is now a placeholder with incorrect logic.
  • Fewer tests pass, presumably as a result of ↑.

Addresses #2033.


Requesting feedback on:

  • Good ways to recover the "text coalesce" functionality, where adjacent text nodes would be joined by \n.
  • Good ways to recover the feature where in loose-mode lists text would sometimes be wrapped into paragraphs.
  • Possibly anything else I've overlooked.

The reason for the above regression is that the main parse loop has been split, is no longer "aware" of token types, and has no special cases. I really want to keep it this way.

Logic for different token types has been moved from switch cases to methods.
The main parse loop now simply looks up the method by token type, falling back
on `default()`.

Token types are now an open set, rather than a closed set. New token types can
be supported by subclassing Parser and adding corresponding methods, or by
overriding `default()`.

Consolidated the parsing logic for "top" and "inline" tokens. Removed
`parser.parseInline`. The top-level function `parseInline` uses
`lexer.lexInline` to generate an appropriate AST; additional restrictions at
the parser level appear redundant.

More flexible approach to "contextual" parsing logic, such as using a different
renderer or indicating "loose" mode. Now every parse method takes an additional
parameter: a "context", which contains a renderer and possibly other settings.
By default, such context objects are allocated lazily and no more than once.
The method signatures make it possible to use context inheritance, where child
contexts inherit from parent contexts, overriding only some of their properties.
This is not used by default, but can be useful for advanced cases such as one
described in markedjs#2033.

Benchmarks indicate no significant performance regressions or improvements.
Differences appear to be within noise levels.

Known regressions that must be fixed before merging the MR:

  * No support for coalescing adjacent `text` nodes.
  * List "loose" mode, with paragraph wrapping, is now a placeholder with
    incorrect logic.
  * Fewer tests pass, presumably as a result of ↑.

Addresses [markedjs#2033](markedjs#2033).
@vercel
Copy link

vercel bot commented May 3, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/B9mki6vjUCbjahZEMcRq1rDyQu1Q
✅ Preview: https://markedjs-git-fork-mitranim-master-markedjs.vercel.app

}

// Regression; missing features:
// * Join adjacent text tokens with '\n'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add tokens and currentIndex to the context object so the text function could combine tokens the same way it does in the switch statement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I misunderstood, sounds like it involves mutating the context. Parsing/compiling any token, even text, is potentially recursive, with nested parse loops. To avoid trampling each other's tokens and currentIndex in the context, parse loops would have to either make new context objects (slow), or stash some contextual values in the stack frame of the current parse loop, and restore them before returning from the loop (shouldn't affect performance, but looks dirty). This also creates a "communication channel" between a parse loop and the methods it's calling, though a shared mutable object. This could work, but a cleaner approach would be much appreciated.

I admit this is a self-inflicted problem. The "old" loop had the power to handle this case and worked just fine. But I also feel that making the parse loop really dumb, where it just calls one method per token, "buys" us nice extensibility through method overrides, and is at least worth exploring.

I don't entirely understand the reasoning behind this special case. Why would there be adjacent text tokens? Is this currently generated anywhere inside Marked, or supported for outside customization? Could this be handled better at the Lexer level? Could it be handled better through text.tokens?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"buys" us nice extensibility through method overrides

If the goal is just to have extensibility it seems like it would be easier to just have it check for function in the default case.

but a cleaner approach would be much appreciated

Clean code isn't really what marked is aiming for. Clean code often comes with slow performance. We try to make it as clean as possible without sacrificing performance at all. The switch statement may be ugly but it works and is as fast as it can be. Perhaps the goal of a "dumb" parse loop is the wrong goal.

@UziTech
Copy link
Member

UziTech commented Jun 20, 2021

I think #2043 fixes the issue this PR would solve

@UziTech UziTech closed this Jun 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants