Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating parser code #213

Open
petee-d opened this issue Jan 29, 2022 · 25 comments
Open

Generating parser code #213

petee-d opened this issue Jan 29, 2022 · 25 comments

Comments

@petee-d
Copy link
Contributor

petee-d commented Jan 29, 2022

Hi there again @alecthomas,

I'm already using this great parser in a large project (a GoLang implementation of Jinja, will be open sourced eventually) and one of the things that somewhat bothers me is the speed and GC pressure (allocations) of the parser. I've considered using the generated lexer to improve it, but it's just not enough. So what if this library could also generate code for the parser? Is this something you've already considered or even started playing with?

I actually already did and have a very ugly prototype that can generate the parser code for this subset of features:

  • string and struct fields only
  • supported nodes (at least mostly) are strct, sequence, token reference, token literal, capture (@) and group (?, *, +, not ! yet)
  • totally trash error reporting
  • max lookahead, case insensitive tokens and probably other options are not respected yet

Example grammar it can parse consistently with the native parser (except for error messages):

type Stuff struct {
	Pos    lexer.Position
	A      string   `'test' @Ident @Ident`
	Sub    SubStuff `@@+`
	B      string   `@Ident`
	EndPos lexer.Position
}

type SubStuff struct {
	Object    string `'(' @Ident 'is'`
	Adjective string `    @Ident ')'`
}

// Uses lexer.TextScannerLexer

For this grammar and with a pre-built PeekingLexer (so that I compare only parsing speed), here are the benchmarks at this moment:

Input string, pre-lexed: "test str ing (this is fast) (this is fast) (this is fast) (this is fast) (this is LAST) end"
BenchmarkNative
BenchmarkNative-8      	  282736	     18882 ns/op	   11536 B/op	     207 allocs/op
BenchmarkGenerated
BenchmarkGenerated-8   	10671034	       576.2 ns/op	       8 B/op	       1 allocs/op

The reason for being >30x faster for this particular grammar (likely a very cherry-picked example) is that the generated code:

  • avoids allocations as much as humanly possible, in fact the only allocation it does in that example is when concatenating strings for Stuff.A
  • doesn't use Reflect, obviously - the huge benefit of generated code
  • avoids unnecessarily small functions - it generates 1 function per struct + 1 wrapper, uses goto (responsibly) to solve things that would otherwise need calling a function or duplicating code
  • also uses some optimizations that could be applied to the native parser (for example: allocation-free PeekingLexer.Clone alternative, avoiding string concatenation when capturing values, etc.)

It's very possible I'll run into an issue I won't be able to overcome, but for now it seems like this should be very doable. The generated code isn't too long (160 LOC for the above grammar), is quite well isolated (adds one generic method to all nodes, plus a file for code generation utilities) and doesn't introduce any new dependencies. For now I would just like to let you know I'm working on this, so we can coordinate any similar efforts. :) I would also appreciate your support with a few questions (will post them in this issue later).

What do you think? Cheers!

@petee-d
Copy link
Contributor Author

petee-d commented Jan 29, 2022

Question 1: Is my understanding correct that the nodes optional (duplicating group with ?) and repetition (duplicating group with *) are no longer used and can be removed?

@alecthomas
Copy link
Owner

I'm already using this great parser in a large project (a GoLang implementation of Jinja, will be open sourced eventually) and one of the things that somewhat bothers me is the speed and GC pressure (allocations) of the parser. I've considered using the generated lexer to improve it, but it's just not enough. So what if this library could also generate code for the parser? Is this something you've already considered or even started playing with?

Hi there, glad you're finding it useful. Yes definitely. There's a ticket tracking performance here.

Those are some nice performance improvements, I like it.

My biggest concern really is supporting the parser options, particularly lookahead. How are you dealing with that?

Question 1: Is my understanding correct that the nodes optional (duplicating group with ?) and repetition (duplicating group with *) are no longer used and can be removed?

That's correct. It's probably the last remaining thing I would like to do on 2.x before moving it out of alpha.

@petee-d
Copy link
Contributor Author

petee-d commented Jan 29, 2022

My biggest concern really is supporting the parser options, particularly lookahead. How are you dealing with that?

I guess I'll know once I progress far enough to try to implement that option (first I'd like to do disjunction and at least some non-string field types). But the groups (? * +) as I implemented them in the generated code should basically be capable of utilizing infinite lookahead and the option is just a matter of limiting that. Disjunction won't be trivial, but what worked for groups, should mostly work there too.

The generated code is still a recursive descent parser, it just has the benefit of knowing the rules, types and options at compile time - this saves a TON of allocations, which were responsible for most of the overhead, along with reflection. It basically calls lexer.Next() with the same decisions as the node.Parse methods, but can afford to be much smarter about the overhead around it and parse tree construction.

That's correct. It's probably the last remaining thing I would like to do on 2.x before moving it out of alpha.

OK, so I'll just leave their methods for parser code generation unimplemented for now.

@petee-d
Copy link
Contributor Author

petee-d commented Feb 6, 2022

A small update for this weekend's progress (can't really give it much time during working days):

  • successfully implemented disjunction, lookaheadGroup and group with ! (non-empty match) - only missing negation now (should be easy)
  • it should support hopefully all field types (bool, ints, uints, floats, strings, structs; pointers to and slices of all the former; plus position and tokens fields) - although for some combinations it might not capture stuff the same way participle does - I think I will be able to fix it
  • I started working on error handling and it's a pretty deep rabbit hole, I still need to take care of using the deepest branch error and efficiently respecting lookahead - but it's definitely doable, just more mind boggling than I had hoped :)

With the features above, the generated code is now able to parse the Jinja templating language with a quite complex grammar (EBNF has 50 lines and 500 words) that actually uses every single node type except negation and positive lookahead group. It also uses a very representative subset of the field types. The test suite for our Jinja interpreter passes with it completely with the exception of tests that compare the parsing errors, which are currently not always consistent.

Here's the benchmark of parsing alone, with a pre-created PeekingLexer:

	BenchmarkParseOnlyNative-8   	     871	   3890325 ns/op	 2737872 B/op	   36392 allocs/op
	BenchmarkParseOnlyGen-8   	   19220	    183382 ns/op	   98744 B/op	     796 allocs/op

That's a >20x improvement in speed and an even bigger improvement in allocations.

By the way, I've also tried the experimental generated lexer, which was able to improve lexing quite a bit but has some issues - it's missing the support for lazy quantifiers and even has a pretty significant bug (I will try to find it and make a PR later). I then wrote a lexer by hand, here are the results for comparison:

	BenchmarkLexNative-8   	     964	   1193749 ns/op	   88051 B/op	    1162 allocs/op
	BenchmarkLexGen-8   	    4207	    240589 ns/op	   65760 B/op	       3 allocs/op
	BenchmarkLexHand-8   	   18738	     63015 ns/op	  130592 B/op	       9 allocs/op

(the extra allocations in the hand-written version is something I will be able to optimize)

@alecthomas
Copy link
Owner

Very nice!

I started working on error handling and it's a pretty deep rabbit hole, I still need to take care of using the deepest branch error and efficiently respecting lookahead - but it's definitely doable, just more mind boggling than I had hoped :)

This doesn't surprise me, and unfortunately error handling isn't really that great anyway. My main concern, and why I've held off on writing a parser generator, is that there a lot of features and improvements I'd still like to add, and I'm concerned about having to maintain two parallel implementations.

That said, I think a generator is very important.

But I do wonder if there isn't a hybrid approach where the runtime parser is refactored to drive both the reflection based implementation and the code generated implementations. The high level parsing logic would live in the parser runtime - error reporting, recovery, etc. while the low level path would be accelerated by the code generation.

PS. The two big improvements I'm thinking of are:

  1. A significant improvement in error reporting.
  2. Support for error recovery so Participle can be used interactively, eg. in editors, LSPs.

@alecthomas
Copy link
Owner

it's missing the support for lazy quantifiers and even has a pretty significant bug

This does not surprise me - it is definitely still experimental.

@petee-d
Copy link
Contributor Author

petee-d commented Mar 1, 2022

Hey, another update after a long time, I was sick in some previous weeks so my work on this slowed down quite a lot.

First of all, I just noticed this branch with parser codegen exists 😅, which I didn't notice before. I duplicated some of the effort but took a quite different approach (function per struct with some even inlined, not per node; not returning slices of tokens but capturing them to struct field ASAP). And my approach seems significantly faster and doing fewer allocations, so it's probably for the better.

BenchmarkGeneratedOld-8   	  994354	      1070 ns/op	     248 B/op	      12 allocs/op
BenchmarkGeneratedNew-8   	 4264864	       277.1 ns/op	      96 B/op	       3 allocs/op

As for a status update, I'm getting close to a relatively stable implementation so I might push something for your initial comments within a week. Error handling seems quite consistent with the reflection based parser now, lookahead works too. I'm just patching up an issue when a failed branch (disjunction/group) may leave some unwanted captured fields behind.

Few questions for you, if you could spare the time:

  1. Can I safely remove repetition and optional nodes in my PR? I see you did it in your parser-codegen branch, so I assume it should be fine. Or would you like to do it in a separate PR earlier?
  2. There is a slightly breaking change I would like to do to PeekingLexer for the sake of improving performance. Currently, PeekingLexer.rawCursor points to the token right after the last returned non-elided token. I would like to change it to always point to the next non-elided token instead. The reason is that in the current design, something like Peek(0) (which the reflection based code calls quite often, mine less so, but still often enough) has to always use a for loop to find the next non-elided token to return, which considerable overhead, especially if there are many elided tokens, but even if there are none. My change makes PeekingLexer find the next non-elided token during initialization and then that for loop only has to run in Next or Peek(n) where n > 0, not for Peek(0). It made the reflection-based parser only a tiny bit faster, but it's very significant in my generated parser. The only issue is that it will change the behavior of the Tokens []lexer.Token field in structs - previously, it could have started with an elided token and would never end with one; now it can end with an elided token but never start with one.
  3. What do you think is the best approach for writing test coverage for this generated code? Should the generated code for each test suite using some grammar be committed in the repository? The alternative is probably much nastier, but I guess it could be done...

@alecthomas
Copy link
Owner

  1. Can I safely remove repetition and optional nodes in my PR? I see you did it in your parser-codegen branch, so I assume it should be fine. Or would you like to do it in a separate PR earlier?

Yep that should be fine.

  1. There is a slightly breaking change I would like to do to PeekingLexer for the sake of improving performance. Currently, PeekingLexer.rawCursor points to the token right after the last returned non-elided token. I would like to change it to always point to the next non-elided token instead. The reason is that in the current design, something like Peek(0) (which the reflection based code calls quite often, mine less so, but still often enough) has to always use a for loop to find the next non-elided token to return, which considerable overhead, especially if there are many elided tokens, but even if there are none. My change makes PeekingLexer find the next non-elided token during initialization and then that for loop only has to run in Next or Peek(n) where n > 0, not for Peek(0). It made the reflection-based parser only a tiny bit faster, but it's very significant in my generated parser. The only issue is that it will change the behavior of the Tokens []lexer.Token field in structs - previously, it could have started with an elided token and would never end with one; now it can end with an elided token but never start with one.

From memory this is used when capturing into a Tokens []Token or EndPos Pos field. If my memory is correct, the semantics can't change.

  1. What do you think is the best approach for writing test coverage for this generated code? Should the generated code for each test suite using some grammar be committed in the repository? The alternative is probably much nastier, but I guess it could be done...

I've checked in code for generated lexers, but it hasn't been ideal. Maybe do one or two checked in, then we can consider what the best approach is.

@alecthomas
Copy link
Owner

PS. Thanks for the update, and I'm glad you're feeling better.

@klondikedragon
Copy link
Contributor

@petee-d - I'm very interested in trying out your prototype parser generator and offering some feedback how it works for my use case, and helping if I can with any code changes to get it production ready. Is there a branch with your work in progress I could try with my prototype?

I've got a prototype lexer/parser with participle (kudos to @alecthomas I'm loving the interface!), and with a fairly complex grammar, I'm parsing hundreds of thousands of very small documents and performance is critical. Parsing about 200,000 documents that totals to about 256 MB of text (in aggregate) is producing about 26 GB total memory allocated (only <100 MB allocated at any given time, but gobs of memory being allocated in total over the 200,000 documents parsed and about 600 GCs triggered during the parsing run). This is even with using the experimental generated lexer. The generated lexer by itself is very fast.

This level of allocations with the reflection-based parser is causing > 20x slowdown vs just lexing the data. I think the generated parser likely would be very fast. I'd love to see if this is the case, rather than hand rolling a parser and using the generated lexer.

I have a test suite with about 500 comprehensive tests that cover the grammar, so I'll be able to offer feedback if your experimental generated parser produces the same results as the reflection parser, etc.

@petee-d
Copy link
Contributor Author

petee-d commented Aug 11, 2022

@klondikedragon Thanks for the comment! My work priorities have quite shifted since the last time I wrote here and I didn't have any capacity to continue this. It's still very much on my mind and I wanted to finish it within personal time, but wasn't able to. You've definitely renewed my interest. :) I would be very happy to see what performance improvements it will give in your use case (I'm hopeful even 20x should be achievable) and whether the generated parser will pass your test suite.

Unfortunately I just tried to rebase my WIP branch to current participle master and since @alecthomas has been quite busy (great changes, BTW!), there was a massive conflict.

If you're not relying too much on the changes done in participle this year (my old base commit is Dec 30th 2021...), you could try using this newly pushed branch in my fork: master...petee-d:participle:old-master-parser-generation

I will definitely try to resolve those conflicts soon and push a branch rebased to master, but it might be worth giving the old branch a shot.

How to use it:

  • There is no CLI utility to generate the parser (but same is true for the lexer). How you call it depends on whether you want the generated code to live in the same package as the parser, or in a separate package. A separate package can offer the benefit of being able to generate the parser more easily after changes to the grammar structs - otherwise some manual cleanup may be needed just to make the package compilable again.
    • Same package: os.WriteFile("./generated.go", participle.GenerateCode(parser, "currentPackageName", false), 0774)
    • Different package: os.WriteFile("generatedPackageName/generated.go", GenerateCode(StuffParser, "generatedPackageName", true), 0774)
  • I plan to enable using the generated parser just by supplying a participle.ParseOption to the existing methods. For now, you need to do something like this, assuming the generated code lives in package generated, your grammar's name is Stuff and Lexer is your generated lexer:
    func TestGeneratedParser(t *testing.T) {
        myInput := "my input string"
        lex, err := Lexer.LexString("", myInput)
        require.NoError(t, err)
        peeking, err := lexer.Upgrade(lex)
        require.NoError(t, err)
        var stuff Stuff
        require.NoError(t, generated.ParseStuff(peeking, &stuff, false))
        fmt.Printf("%#v", stuff)
    }
    

Good luck, eagerly waiting to hear about your results, if you're able to use it.

@petee-d
Copy link
Contributor Author

petee-d commented Aug 11, 2022

Also, please coordinate with me before contributing on top of my mess - I definitely would first like to rebase it, split it to more reasonable commits and discuss certain breaking changes (minor, but breaking) I did to core participle.

@klondikedragon
Copy link
Contributor

@petee-d -- that's awesome! I will try out this experimental branch and report back with the results. It might take me a few days. I'll share the perf/test-suite results and coordinate before attempting any PR changes.

@klondikedragon
Copy link
Contributor

@petee-d - I'm working through it and have it generating the parser code from the grammar.

Note that in the older version of participle about 40 of my test cases are failing, even without using the generated parser. It looks like it has to do with Lexer changes/fixes in the newer version. I haven't dug into it too deep yet. My lexer def has some very complex regular expressions and different states, so I haven't spent the time to completely understand. If you have time to rebase the code against the latest hopefully the newest code would pass cleanly. (Note that the generated lexer on the current version of participle also doesn't pass the test suite, so that also points to potential changes in the lexer since the auto-gen lexer code was crafted).

I had it generate code in the same package, since I found that when generating in the mode for a different package, the "source" package wasn't auto-detected correctly (it had it as the participle package, instead of the application package that contained the grammar structures). No big deal, maybe that is hard to detect, and just ask the caller to specify the name of the package that contains the grammar structs (empty string indicates same package) instead of passing a bool.

One minor issue, I have in the grammar a rule like the following, where it's indicating zero or more elements, and it's stored into an array of pointers to the struct representing the captured rule:
image

The generated code is assuming that the array is an array of the struct, instead of an array of a pointer to the struct:
image

This generated code had to be manually tweaked a little bit to compile:
image

It might be the case that it's actually more efficient to just use an array of struct and the auto-gen code was expecting that? I tried changing the grammar definition to just have arrays of structs instead of struct pointers and then the autogen code didn't have that issue.

Another small issue, it complained about "strconv" package not being used so that line had to be removed to compile.

At this point, I got the generated parser compiling, but it was panicing running the test suite, complaining about an out of bounds index in (*PeekingLexer).Current. It turns out, this is because I have a very unusual rule in my grammar that is designed to match anything. For unusual reasons, I have a "capture all" garbage rule that is last in the disjunction of possible rules that will match any token. Rather than matching a list of all tokens, the rule definition was lazy and was defined to be either the whitespace token or the negation of the whitespace token (i.e. any other token). The generated parser was matching literally anything in the negation case, not considering the EOF condition.

Here is a simpler version of what the grammar looks like:

type aeData struct {
	DataElements []aeDataElement `parser:"@@*"`
}

type aeDataElement struct {
	Pos    lexer.Position
	EndPos lexer.Position
	Rule1 *aeRule1 `parser:"( @@"`
	Rule2 *aeRule2 `parser:"| @@"`
	// otherwise it's garbage, just soak up any token
	Garbage string `parser:"| @~Whitespace | @Whitespace )"`
}

In the main generator function, inside the generated for loop for group AeDataElement* it had code that looked like this for handling the Garbage disjunction case:

// capture Garbage from ~<whitespace>
// negation ~<whitespace>
{
	branchCheckpoint := p.Lex.MakeCheckpoint()
	// reference <whitespace>
	if current.Type != -42 {
		p.SetTokenError("")
		goto negation401Error
	}
	if vAeDataElement.Garbage == "" {
		vAeDataElement.Garbage = current.Value
	} else {
		vAeDataElement.Garbage += current.Value
	}
	_, _ = p.Lex.Next()
	current = p.Lex.Current()
	p.SetTokenError("")
	goto disjunction44Alt2Error
negation401Error:
	p.Lex.LoadCheckpoint(branchCheckpoint)
	current = p.Lex.Current()
	p.SetNoError(false)
	if vAeDataElement.Garbage == "" {
		vAeDataElement.Garbage = current.Value
	} else {
		vAeDataElement.Garbage += current.Value
	}
	_, _ = p.Lex.Next()
	current = p.Lex.Current()
}

Notice in the negation401Error case it was advancing the token unconditionally and then right below the loop would go through again.

At first, I tried changing the (*PeekingLexer).Current function to just return &p.eof in the case that rawCursor was advanced beyond the end:

func (p *PeekingLexer) Current() *Token {
	if int(p.rawCursor) >= len(p.tokens) {
		return &p.eof
	} else {
		return &p.tokens[p.rawCursor]
	}
}

However, this then resulted in the for loop continuing until it had captured 1000000 garbage elements (that negation case just kept on capturing, even the EOF token).

I manually modified the auto-generated code to put a check for Lexer EOF at the top of the for loop for the group AeDataElement* case:

	// group AeDataElement*
	for matches := 0; ; matches++ {
		// =============== (manual edit begin)
		if p.Lex.Current().EOF() {
			break
		}
		// =============== (manual edit end)
		branchCheckpoint := p.Lex.MakeCheckpoint()

With this manual edit to the grammar, it's now able to run the full test suite! Interestingly, only 19 test cases failed with the auto-generated parser, instead of 40 with the reflection-based parser. The 19 cases that are failing look to all be related to the older lexer code producing wrong results (not related to the auto-gen parser). So the auto-generated parser somehow fixed some issues in the reflection parser for this older version of participle. (In the newer participle, all test cases are passing in the reflection parser.) Now that I have this running, I'll collect benchmark results and report back!

@alecthomas
Copy link
Owner

Awesome result and thanks for the update!

@petee-d
Copy link
Contributor Author

petee-d commented Aug 13, 2022

@klondikedragon thanks for trying the old branch! Rebasing to master is going to take some effort as there really were quite a few changes, plus I'd like to rework some stuff I initially did in a quick&dirty way to a version I would actually dare to submit as a merge request. :) But I already started working on it.

has to do with Lexer changes/fixes in the newer version

Very possible, I knew about one or two bugs in the generated lexer implementation in the December version of participle master I was using. It's very likely they have been fixed in the meantime. But do I understand correctly generating the lexer with current master didn't help? Maybe I'll look into it later - I've actually been thinking of trying to improve the lexer code generation as well, but that's for much much later.

the "source" package wasn't auto-detected correctly (it had it as the participle package)

I see it, I suppose I was trying something and forgot to revert it as the example grammar I was testing the dstPackageExternal option with was actually located in the participle package itself. :D I will definitely completely rework those options, the current interface was just a quick way to achieve what I needed for testing it during development.

generated code is assuming that the array is an array of the struct, instead of an array of a pointer to the struct

Yep, that's a bug, already see it, will fix. Wasn't intentional at all, although it's true you are better off not wrapping the slice elements in pointers anyway.

complained about "strconv" package not being used

I know about that issue actually, it just didn't bother me too much as in my tests strconv was always being used. I intend to fix it with adding var _ = strconv.ParseInt as that's easier than detecting whether strconv is actually needed. :D

out of bounds index in (*PeekingLexer).Current

Thanks for describing the problem so well, will look into it. :) So far I'm thinking of just adding p.eof to the end of p.tokens and making sure Next or other methods won't advance beyond it. That will allow Current (or even Peek in the current master`) to work without any bounds checks, which is IMHO preferred as it's a huge heat-point. The second suggested fix (checking for EOF if the token type is consumed indiscriminately by the generated parser) seems like the right solution there.

the auto-generated parser somehow fixed some issues in the reflection parser

No idea what's happening there, but I'll test this stuff thoroughly after the rebase to make sure there are no bugs either way.

I'll collect benchmark results and report back

Eagerly awaiting the results! :)

@klondikedragon
Copy link
Contributor

@petee-d - the results are pretty incredible for their speedup, both in wall clock time, CPU time, and total allocations saved...

I ran various test runs as well as did CPU/allocation-site pprof to analyze the behavior.

Here is the performance of the application parsing about 140,000 documents (totals to 250 MB in aggregate) using various combos of reflection or generated lexers/parser (all from your experimental branch):

code variant                    wall clock  GC % of CPU   Memory allocation stats
==============================================================================================================================================================
stdlexer-stdparser              39 seconds          25%   MEMORY: Alloc =  83 MiB, TotalAlloc = 24962 MiB, HeapInuse =  92 MiB, Sys = 285 MiB, NumGC = 460
stdlexer-genparser              14 seconds          20%   MEMORY: Alloc =  56 MiB, TotalAlloc =  5078 MiB, HeapInuse =  74 MiB, Sys = 336 MiB, NumGC =  90
genlexer-genparser               7 seconds          30%   MEMORY: Alloc = 115 MiB, TotalAlloc =  4923 MiB, HeapInuse = 124 MiB, Sys = 308 MiB, NumGC =  91
pooledgenlexer-genparser         5 seconds          25%   MEMORY: Alloc =  60 MiB, TotalAlloc =  3649 MiB, HeapInuse =  78 MiB, Sys = 297 MiB, NumGC =  66
pooledgenlexer-genparser-nopost  4 seconds          22%   MEMORY: Alloc =  65 MiB, TotalAlloc =  2559 MiB, HeapInuse =  67 MiB, Sys = 154 MiB, NumGC =  90
nolexer-noparser-nopost          3 seconds          22%   MEMORY: Alloc =  80 MiB, TotalAlloc =   745 MiB, HeapInuse =  82 MiB, Sys = 123 MiB, NumGC =  34

(The input was the same each time, and it's a deterministic workload. Based on CPU profiling, the wall clock time is roughly equivalent to CPU time--app is CPU-bound the whole time.)

Note the last line shows the overhead of the application just to receive/decompress the document data, so to understand lexer/parsing overhead, we need to subtract that line. Also note the difference between pooledgenlexer-genparser and pooledgenlexer-genparser-nopost... the -nopost label indicates there is no application post-processing of the parsed data. This is also overhead that should be subtracted to really understand the overhead of just the lexer/parser.

In other words, about 4 seconds (3 seconds plus (5 minus 4) seconds) of wall clock/CPU time is spent by the app doing things other than lexing/parsing, and about 1835 MB (745 MB plus (3649 - 2559) MB) of the total allocations are from the app doing other things. We can thus break down the wall/CPU time and total allocation performance of the various combinations as follows:

code variant               lexer+parser time   CPU speedup   TotalAlloc   AllocReduction
========================================================================================
stdlexer-stdparser                35 seconds           N/A    23127 MiB              N/A
stdlexer-genparser                10 seconds          3.5x     3243 MiB             7.1x
genlexer-genparser                 3 seconds         11.6x     3088 MiB             7.5x
pooledgenlexer-genparser        ~1.5 seconds         23.3x     1814 MiB            12.7x

So with the code as-is, the CPU speedup is 11.6x and a total allocated bytes reduction of 7.5x! This is already fantastic!

Profiling the genlexer-genparser case, a large portion of the time and total allocations were from the formation of the PeekingLexer struct and the formation/extension of the tokens slice inside of it. As a result, I added an option to use sync.Pool for Upgrade to create the PeekingLexer, removing this overhead, and then producing a 23.3x CPU speedup and 12.7x reduction in total allocated bytes. My use case is probably unusual as I'm parsing thousands of relatively small documents instead of a few number of relatively larger docs.

For reference, here is the code for the pooled peeking lexer:
var peekingLexerPool = sync.Pool{
	New: func() interface{} {
		return &PeekingLexer{
			elide: make(map[TokenType]bool, 4),
		}
	},
}

// Upgrade a Lexer to a PeekingLexer with arbitrary lookahead.
// Faster if you need to lex thousands of similar documents.
//
// "elide" is a slice of token types to elide from processing.
//
// You must call `PutBackPooledPeekingLexer` once done with the
// returned lexer in all cases (ok or error).
func UpgradePooled(lex Lexer, elide ...TokenType) (*PeekingLexer, error) {
	r := peekingLexerPool.Get().(*PeekingLexer)
	// reset the state of the PeekingLexer to empty (preserving any allocated capacity)
	r.Checkpoint.rawCursor = 0
	r.Checkpoint.cursor = 0
	// note: this preserves capacity
	r.tokens = r.tokens[:0]
	for k := range r.elide {
		delete(r.elide, k)
	}
	return fillInPeekingLexer(lex, r, elide...)
}

func PutBackPooledPeekingLexer(r *PeekingLexer) {
	if r != nil {
		peekingLexerPool.Put(r)
	}
}

// Upgrade a Lexer to a PeekingLexer with arbitrary lookahead.
//
// "elide" is a slice of token types to elide from processing.
func Upgrade(lex Lexer, elide ...TokenType) (*PeekingLexer, error) {
	r := &PeekingLexer{
		elide: make(map[TokenType]bool, len(elide)),
	}
	return fillInPeekingLexer(lex, r, elide...)
}

func fillInPeekingLexer(lex Lexer, r *PeekingLexer, elide ...TokenType) (*PeekingLexer, error) {
	for _, rn := range elide {
		r.elide[rn] = true
	}
	if batchLex, ok := lex.(BatchLexer); ok {
		for {
			batch, err := batchLex.NextBatch()
			if err != nil {
				return r, err
			}
			r.tokens = append(r.tokens, batch...)
			last := batch[len(batch)-1]
			if last.EOF() {
				r.eof = last
				break
			}
		}
	} else {
		for {
			t, err := lex.Next()
			if err != nil {
				return r, err
			}
			r.tokens = append(r.tokens, t)
			if t.EOF() {
				r.eof = t
				break
			}
		}
	}
	r.rawCursor = r.findNonElided(0) // Set rawCursor to the first non-elided token
	return r, nil
}

So the performance of your generated parser combined with the generated lexer is very impressive for my use case! It's able to parse/lex about 250 MB of text in ~1.5 seconds (excluding time to load from disk, etc.).

The performance of the generated lexer/parser using sync.Pool for the PeekingLexer is very good for this application's needs and further optimization wouldn't be needed. It seems the generated parser is also producing accurate results, it's the generated lexer that has more accuracy issues. Once the branch is rebased I can try it against the latest reflection parser to see if the generated parser passes the full test suite without any accuracy issues. I'm definitely motivated to help as I can, I'll try to isolate some simple cases where the generated lexer is not accurate.

Since I'm curious I profiled the genlexer-genparser case and it looks like most of the parsing time is just spent in golang runtime allocating objects needed for the AST. Here is a flamegraph showing just the portion of CPU time for the auto-generated parser:

image

Note in the flamegraph that nearly all the orange blocks you see taking up CPU time are related to the runtime for allocation, rather than CPU time for actual parser work. This could potentially be optimized by using a sync.Pool style "arena" for allocating AST objects, similar to how fastjson does it. But it doesn't seem worth the complexity, especially since the codegen has to deal with arbitrary user structures. I only mention this to point out that the generated parser is ridiculously fast and is now only limited by the speed of the golang runtime itself!

It's much appreciated if you have time to make progress on your branch! I'll be excited to try it out and report back on accuracy in combo with the current reflection lexer. In the meantime I'll try to isolate the bugs I'm seeing in the generated lexer.

@klondikedragon
Copy link
Contributor

P.S. using the pooled PeekingLexer is easy enough, and Upgrade is still available if using a pool isn't desired:

data := &aeData{}
stringLexer, _ := aelexer.Lexer.(lexer.StringDefinition)
lex, err := stringLexer.LexString("", s)
if err != nil { return nil, err }
peekingLex, err := lexer.UpgradePooled(lex)
defer lexer.PutBackPooledPeekingLexer(peekingLex)
if err != nil { return nil, err }
err = ParseAeData(peekingLex, data, true)
...

@petee-d
Copy link
Contributor Author

petee-d commented Aug 13, 2022

@klondikedragon those are some really interesting results, thank you very much for sharing them. :)

Could you also do a comparison benchmark for just the parsing step on some quite representative single sample? Meaning you would construct a single PeekingLexer at the start of the benchmark and then run just the parsing on that lexer in a loop (restore Checkpoint before each iteration) and do no post-processing. That would allow seeing the difference of parsing really easily.

Also, it would be nice to show the number of allocations instead of the allocated volume - that's much more indicative of allocation pressure and related CPU slowdown.

Anyway, it does indeed seem like the bottleneck are the allocations for the AST itself. You could also try optimizing the AST structs to avoid pointers whenever possible - copying is generally much cheaper than an allocation.

As for pooling each pointer type used in the AST (and maybe even slices, although those are more tricky), this is actually something I totally already considered and will maybe do in the future. It might not be that complex - creating the sync.Pools for each unique type is quite easy, so is using them when doing allocations, the only kinda difficult part is generating a bunch of functions that will be able to "destroy" an AST - recursively dismantle it and return the objects to the pools.

@klondikedragon
Copy link
Contributor

oops LOL, sorry had the input size off by a factor of 10x in my previous post, the input size in above tests was 25 MB, not 250 MB. Still that means the app was doing about 16 MB/sec lexing/parsing throughput for a mixed real life workload, which is great.

Pooling AST types (and possibly slices) would definitely be a nice bonus. Thanks for the tip on avoiding pointers in AST structs! This app will be operating in a low memory container, so allocation sizes might also be relevant as it could pressure more GC to happen if it backs up against the available mem.

For a representative "large" doc (about 410 bytes), here are some results where the lexer/parser is generated each time:

cpu: AMD Ryzen 7 4700U with Radeon Graphics
BenchmarkReflectLexerReflectParser-8        1644            610855 ns/op          392370 B/op       5159 allocs/op
BenchmarkGenLexerReflectParser-8            2096            544034 ns/op          387531 B/op       4970 allocs/op
BenchmarkReflectLexerGenParser-8           10000            112025 ns/op           53660 B/op        331 allocs/op
BenchmarkGenLexerGenParser-8               25848             45990 ns/op           50184 B/op        143 allocs/op
BenchmarkGenPooledLexerGenParser-8         28350             36449 ns/op           39953 B/op        141 allocs/op

Max throughput of 11.6 MB/sec. Observing a 16.8x CPU speedup and 36x allocs reduction.

And again with a representative small doc (about 42 bytes):

cpu: AMD Ryzen 7 4700U with Radeon Graphics
BenchmarkReflectLexerReflectParser-8        8402            134089 ns/op           88975 B/op       1088 allocs/op
BenchmarkGenLexerReflectParser-8           10000            121771 ns/op           88236 B/op       1078 allocs/op
BenchmarkReflectLexerGenParser-8           47071             25827 ns/op           12112 B/op         88 allocs/op
BenchmarkGenLexerGenParser-8               88209             13592 ns/op           11848 B/op         78 allocs/op
BenchmarkGenPooledLexerGenParser-8         84782             12886 ns/op           11936 B/op         79 allocs/op

Max throughput of 3.6 MB/sec. 10.4x CPU speedup and 13.7x allocs reduction.

The above represents the app use case more directly where it has to construct a lexer/parser each time.

As requested, here's some benchmarks constructing lexer/parser once and restoring the checkpoint each loop (timer reset after lex/parser construction before the loop):

Large doc case:

cpu: AMD Ryzen 7 4700U with Radeon Graphics
BenchmarkCheckpointReflectLexerReflectParser-8              2478            514544 ns/op          354271 B/op       4955 allocs/op
BenchmarkCheckpointGenLexerReflectParser-8                  2300            509689 ns/op          354268 B/op       4955 allocs/op
BenchmarkCheckpointReflectLexerGenParser-8                 61378             19289 ns/op           16936 B/op        128 allocs/op
BenchmarkCheckpointGenLexerGenParser-8                     61602             19233 ns/op           16936 B/op        128 allocs/op

25.2 MB/sec. 26.7x CPU speedup and 38.7x alloc reduction!

Small doc case:

cpu: AMD Ryzen 7 4700U with Radeon Graphics
BenchmarkCheckpointReflectLexerReflectParser-8              9981            114210 ns/op           85947 B/op       1069 allocs/op
BenchmarkCheckpointGenLexerReflectParser-8                 10000            115339 ns/op           85947 B/op       1069 allocs/op
BenchmarkCheckpointReflectLexerGenParser-8                142838              8117 ns/op            9560 B/op         69 allocs/op
BenchmarkCheckpointGenLexerGenParser-8                    140283              8163 ns/op            9560 B/op         69 allocs/op

5.9 MB/sec. 14.0x CPU speedup and 15.5x alloc reduction!

Very impressive!

@alecthomas
Copy link
Owner

I'm loving these numbers!

I have some concerns with the long term maintenance of both the lexer and parser code generators which we should solve (or at least, not make worse) before getting these changes into master.

I've opened #264 with some initial thoughts.

alecthomas added a commit that referenced this issue Sep 4, 2022
- Add a CLI tool that can ingest the JSON and dump out the generated code.
- Lexers can now be JSON marshalled.
- Add a goreleaser step for the binary.

As discussed in #213
alecthomas added a commit that referenced this issue Sep 4, 2022
- Add a CLI tool that can ingest the JSON and dump out the generated code.
- Lexers can now be JSON marshalled.
- Add a goreleaser step for the binary.

As discussed in #213
alecthomas added a commit that referenced this issue Sep 4, 2022
- Add a CLI tool that can ingest the JSON and dump out the generated code.
- Lexers can now be JSON marshalled.
- Add a goreleaser step for the binary.

As discussed in #213
@klondikedragon
Copy link
Contributor

klondikedragon commented Dec 10, 2022

(edit: I had an issue with the PooledLexer benchmarks before, this has been fixed!)

After #263 merged, I've tried the latest WIP parsergen code. Lots of exciting progress!!

  • This latest approach solves the ergonomics issue IMO -- being able to compile the program even when the generated code is deleted solves the chicken/egg problem and it's also very convenient to select whether or not the generated parser is used just via a participle Option.
  • There's still an infinite loop in the generated code because of my unusual grammar (see Generating parser code #213 (comment) ). I was able to solve it using the same approach as described in that comment:
        // group AeDataElement*
	for matches := 0; ; {
		// =============== (manual edit begin)
		if c.Lex.Peek().EOF() {
			break
		}
		// =============== (manual edit end)
		branchCheckpoint := c.Lex.MakeCheckpoint()

This manual edit fixes the infinite loop. So adding a check for EOF at the top of a loop for group matches in the code generator is probably the simplest/easiest way to solve for that?

  • This is failing 20 unit tests in my test suite -- the test results are now the same whether it's using the reflection lexer or the gen lexer, so there looks to be a bug in the generated parser. I'll dig in to see if I can produce a small test case and a hopefully a fix also. All of the failing test cases look like the issue is likely all related to a single part of the grammar.
  • I'm seeing an 8%-10% speedup compared to past results above! and then combined with pooling allocation for the lexer, I'm seeing a 25%-30% speedup.

Results for the "large doc" (~410 bytes) case:

BenchmarkReflectLexerReflectParser-8             	    2296	    506118 ns/op	  356109 B/op	    4082 allocs/op
BenchmarkGenLexerReflectParser-8                 	    2534	    440994 ns/op	  351278 B/op	    3894 allocs/op
BenchmarkReflectLexerGenParser-8                 	   10000	    103200 ns/op	   52864 B/op	     334 allocs/op
BenchmarkGenLexerGenParser-8                     	   28342	     43525 ns/op	   49480 B/op	     146 allocs/op
BenchmarkGenPooledLexerGenParser-8               	   36002	     35948 ns/op	   16675 B/op	     135 allocs/op
BenchmarkCheckpointReflectLexerReflectParser-8   	    2914	    419529 ns/op	  318068 B/op	    3879 allocs/op
BenchmarkCheckpointGenLexerReflectParser-8       	    2984	    413428 ns/op	  318067 B/op	    3879 allocs/op
BenchmarkCheckpointReflectLexerGenParser-8       	   65199	     17668 ns/op	   16280 B/op	     131 allocs/op
BenchmarkCheckpointGenLexerGenParser-8           	   6

Small doc (42 bytes) case:

BenchmarkReflectLexerReflectParser-8             	   10000	    108762 ns/op	   79189 B/op	     840 allocs/op
BenchmarkGenLexerReflectParser-8                 	   12331	     97234 ns/op	   78520 B/op	     830 allocs/op
BenchmarkReflectLexerGenParser-8                 	   49084	     24309 ns/op	   12214 B/op	      90 allocs/op
BenchmarkGenLexerGenParser-8                     	   93055	     13167 ns/op	   11976 B/op	      80 allocs/op
BenchmarkGenPooledLexerGenParser-8               	  100496	     11749 ns/op	    9883 B/op	      73 allocs/op
BenchmarkCheckpointReflectLexerReflectParser-8   	   13179	     90247 ns/op	   76280 B/op	     821 allocs/op
BenchmarkCheckpointGenLexerReflectParser-8       	   13228	     89081 ns/op	   76280 B/op	     821 allocs/op
BenchmarkCheckpointReflectLexerGenParser-8       	  155655	      7591 ns/op	    9736 B/op	      71 allocs/op
BenchmarkCheckpointGenLexerGenParser-8           	  159429	      7646 ns/op	    9736 B/op	      71 allocs/op

When profiling the latest code, memory allocation / GC is the largest bottleneck. The pooled lexer eliminates one of the largest sources of that bottleneck for my use case (where I'm parsing hundreds of thousands of relatively small docs). I'll submit a PR for lexer pooling for consideration. If the generated parser were to ever use pools for its allocations it could speed things up further, but it's already very speedy!

@petee-d
Copy link
Contributor Author

petee-d commented Dec 10, 2022

@klondikedragon thanks for sharing those results. I'm very glad you like the approach to that chicken/egg problem, took quite some back-and-forth to arrive at that solution. I initially also wanted to somehow make it so that generating the parser code wouldn't even import the generated code and thus you could maybe re-generate the code even if changes to the grammar made the code invalid - but I wasn't able to find any way to do that. Deleting the file seems like a reasonable compromise as it can be added as a first step to some Make command the developer would use to regenerate the grammar.

As for that negation EOF bug, I forgot to look into it once I fixed the first issue with consuming EOF out of the peeking lexer. Fixed now on the branch, along with a second bug. It's possible the test failures you had will be fixed by that second fix.

BTW, I actually didn't have test coverage for negation yet and my Jinja parser doesn't use it, so I basically never tried it. But now the branch includes test coverage for negation with many edge cases, so I think it should be OK. If not, some small demonstration of the remaining issue would be very appreciated. :)

Meanwhile, I'm continuing with writing a test suite for the generated parser. Since it will also test the reflective parser on the same grammar and that grammar will test just about every participle feature once it's finished, I think it will even give better test coverage than all the existing test suites, especially for error reporting.

@klondikedragon
Copy link
Contributor

@petee-d -- with your latest branch, it fixes the negation EOF bug, and also the test suite is now passing 100%!

With all of the work over the past months + the pooled lexer patch, there is now an incredibly impressive 25x speedup in parsing throughput vs the original reflection lexer/parser!! (41971 docs/second vs 1644 docs/second).

Participle takes the drudgery out of creating custom lexers/parsers, and now it delivers production-level performance also!

@alecthomas
Copy link
Owner

This is super exciting, awesome work :)

Meanwhile, I'm continuing with writing a test suite for the generated parser. Since it will also test the reflective parser on the same grammar and that grammar will test just about every participle feature once it's finished, I think it will even give better test coverage than all the existing test suites, especially for error reporting.

If this one is better, let's delete the old test suite once this lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants