Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable v2 release (API changes) #108

Open
10 of 19 tasks
alecthomas opened this issue Sep 7, 2020 · 9 comments
Open
10 of 19 tasks

Stable v2 release (API changes) #108

alecthomas opened this issue Sep 7, 2020 · 9 comments

Comments

@alecthomas
Copy link
Owner

alecthomas commented Sep 7, 2020

Now that Participle has proven its initial concept, I think it's time to clean up the API. This will be a backwards incompatible change.

Work has started in the v1 branch.

  • Consolidate on Stateful lexer (1444519)
  • Optimise performance of the lexer (Runelookup to avoid testing regexps that have no chance to match #111)
  • Make specifying filename explicit. This removes confusion and ambiguity. (cf6162a)
  • Get rid of unquoting hacks in text/scanner lexer. (4f53af9)
  • Clean up error functions. (895f942)
  • Eliminate internal unquoting and single quote munging from text/scanner based lexer. (4f53af9)
  • Extend the concept of Pos/EndPos to support capturing the full range of tokens the node matched, including Elide()ed tokens. (2ace05e)
  • Refactor Mapper to eliminate the need for DropToken. (f82f615)
  • Capture directly into fields of type lexer.Token and []lexer.Token. (3b1f151)

Maybe:

  • Extend participle.Elide() support so that elided tokens can be captured explicitly by name (but also see next point).
  • Support streaming tokens from an io.Reader - currently the full input text is read.
    • Refactor PeekingLexer so it doesn't consume all tokens up front.

Once the API is stable, some additional changes would be welcome:

  • Optimise the parser.
  • Code generation for lexing (e2b420f).
  • Code generation for parsing.
  • Improve error reporting.
  • Error tolerant parsing.
  • LSP support? Can this be generalised?
  • Generate syntax definition files for Textmate etc.?!

Regarding streaming, I'm not convinced this is a worth the considerable extra complexity it will add to the implementation. For comparison, pigeon also does not support streaming.

Additionally, to support the ability to capture raw tokens into the AST, participle will need to potentially buffer all tokens anyway, effectively eliminating the usefulness of streaming. It also vastly increases the complexity of the lexers, requiring three paths (io.Reader, string and []byte), PeekingLexer, etc.

This increased complexity is mainly due to the lookahead branching, and the lexer needs to have a similar implementation to the rewinder RuneReader code (https://play.golang.org/p/uZQySClYrxR). This is because for each branch the state of the lexer has to be stored but also, additionally, as each branch progresses it needs to preserve any new tokens that are buffered so that if the branch is not accepted the parent can remain consistent.

There's also a non-trivial amount of overhead introduced for reading each token, as opposed to the current PeekingLexer which is just an array index.

@ceymard
Copy link
Contributor

ceymard commented Sep 7, 2020

Alright, feature request time

  • Find a way to get the original text of a match
  • Allow Tokens to be requested even though they're marked as elided
  • Have Lexer work over a Reader (with buffering) to allow for parsing huge files

@alecthomas alecthomas changed the title Stable v1 API changes Stable v1 release (API changes) Sep 7, 2020
@alecthomas alecthomas modified the milestone: v1 Sep 7, 2020
@hinshun
Copy link
Sponsor Contributor

hinshun commented Sep 8, 2020

@ceymard Perhaps done better in participle, but currently we use io.TeeReader before we pass into the participle parser to keep the original text. We use this to construct error reporting and source mapping:

@ceymard
Copy link
Contributor

ceymard commented Sep 8, 2020

@hinshun I'm doing doing something similar at the moment ; I just wish for something to get a match easily, without having to resort to that kind of trick.

alecthomas added a commit that referenced this issue Sep 18, 2020
This speeds up parsing by 5-10%:

    benchmark                        old ns/op     new ns/op     delta
    BenchmarkEBNFParser-12           143589        129605        -9.74%
    BenchmarkParser-12               395397        375403        -5.06%
    BenchmarkParticipleThrift-12     202280        191766        -5.20%
    BenchmarkParser-12               7724639       7114586       -7.90%

See #108.
alecthomas added a commit that referenced this issue Sep 20, 2020
This includes tokens elided by Elide(), but not tokens elided by the
Lexer.

See #108.
@alecthomas
Copy link
Owner Author

This functionality is now included natively. Any node with a field Tokens []lexer.Token will now be populated with the full set of tokens used to parse that node. There's an example in the tests here.

@alecthomas
Copy link
Owner Author

You can also now capture directly into a field of type lexer.Token rather than string (for example).

@hinshun
Copy link
Sponsor Contributor

hinshun commented Sep 21, 2020

Do the Tokens include the ones from nested structs if the nested structs also have Tokens []lexer.Token?

@alecthomas
Copy link
Owner Author

Yes they do.

@ceymard
Copy link
Contributor

ceymard commented Sep 21, 2020

Do they include the elided ones as well ?

@alecthomas
Copy link
Owner Author

Yep!

alecthomas added a commit that referenced this issue Nov 26, 2020
This speeds up parsing by 5-10%:

    benchmark                        old ns/op     new ns/op     delta
    BenchmarkEBNFParser-12           143589        129605        -9.74%
    BenchmarkParser-12               395397        375403        -5.06%
    BenchmarkParticipleThrift-12     202280        191766        -5.20%
    BenchmarkParser-12               7724639       7114586       -7.90%

See #108.
alecthomas added a commit that referenced this issue Nov 26, 2020
This includes tokens elided by Elide(), but not tokens elided by the
Lexer.

See #108.
alecthomas added a commit that referenced this issue Nov 26, 2020
This speeds up parsing by 5-10%:

    benchmark                        old ns/op     new ns/op     delta
    BenchmarkEBNFParser-12           143589        129605        -9.74%
    BenchmarkParser-12               395397        375403        -5.06%
    BenchmarkParticipleThrift-12     202280        191766        -5.20%
    BenchmarkParser-12               7724639       7114586       -7.90%

See #108.
alecthomas added a commit that referenced this issue Nov 26, 2020
This includes tokens elided by Elide(), but not tokens elided by the
Lexer.

See #108.
@alecthomas alecthomas changed the title Stable v1 release (API changes) Stable v2 release (API changes) Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants