Streamline long term maintenance of lexer/parser code generators #264

alecthomas · 2022-08-14T09:43:44Z

I have some concerns with the long term maintenance of both the lexer and parser code generators which we should solve (or at least, not make worse) before getting these changes into master.

Problem 1: divergence

My first concern is that as the runtime code evolves, it diverges from the code generators. This has already occurred with the lexer, where the generated code does not handle lookahead. There will need to be some tests to detect this somehow. For the parser, enumerating the nodes and ensuring they're all handled by the generator (somehow) might work.

Conformance tests for lexer
Conformance tests for parser

Problem 2: ergonomics

There are also some ergonomic issues with generating the code. Specifically, having to have an implementation of the runtime code lying around as the "source of truth".

Proposal: serialisable form for both lexers and parse trees

Serialisable lexer
Create participle gen lexer
Serialisable parser
Create participle gen parser

One solution to this is for the code generators to be decoupled completely from the code. The lexer and parser would be extended to be JSON marshallable and the code generators would become standalone binaries that could ingest this serialised form and output code. This might be non-trivial for the parser because it is tightly coupled to reflection - TBD.

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

The text was updated successfully, but these errors were encountered:

alecthomas · 2022-09-04T10:51:16Z

This has been implement for the lexer in bfe7c69.

klondikedragon · 2022-09-22T05:49:24Z

@alecthomas - This is a nice separation to have!

I've tested the update in bfe7c69, it looks like there is one more issue in the code generated from the serialized JSON definition of the lexer. Imagine that the lexer codegen was invoked using something like:

~/go/bin/participle gen lexer --name GeneratedLexer lexer < internal/lexer/lexer.json | gofmt > internal/lexer/lexer_generated.go

And imagine that the lexer definition had included some rules that looked like:

	"AllStrings": {
		{Name: `DoubleQuoteStringBegin`, Pattern: `"`, Action: lexer.Push(`DoubleQuoteString`)},
		{Name: `SingleQuoteStringBegin`, Pattern: `'`, Action: lexer.Push(`SingleQuoteString`)},
		{Name: `BacktickStringBegin`, Pattern: "`", Action: lexer.Push(`BacktickString`)},
	},

The codegen template that generates code to push a new lexer state intends to refer to the state struct name that includes the name specified on the command line (e.g., GeneratedLexer in the above example)
bfe7c69#diff-5119a11cbfb79273b05b2942dfb351ab4bf736ca7cf91d773f360e75bcb07eb4R173

However, what the template generates instead is to put the state struct name using the name of the rule instead of the top level struct name. For example, it generates

	switch state.name {
	case "AllStrings":
		if match := matchDoubleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -22
			groups = match[:]
			l.states = append(l.states, lexerDoubleQuoteStringBeginState{name: "DoubleQuoteString"})
		} else if match := matchSingleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -23
			groups = match[:]
			l.states = append(l.states, lexerSingleQuoteStringBeginState{name: "SingleQuoteString"})
		} else if match := matchBacktickStringBegin(l.s, l.p); match[1] != 0 {
			sym = -24
			groups = match[:]
			l.states = append(l.states, lexerBacktickStringBeginState{name: "BacktickString"})
		}

In all cases in this example, the proper name of the struct would have been lexerGeneratedLexerState.

Manually changing this the generated code compiles as before.

I've also been debugging why the generated lexer isn't producing 100% accurate results vs the reflection lexer. It seems it may not be handling the special lookahead word boundary \b quite the same, and in another case it may not be greedily matching the same in a regexp with a complex set of nested non-matching groups. I have to have some simple rules that highlight the gaps in the codegen lexer shortly.

klondikedragon · 2022-09-24T03:43:31Z

@alecthomas - here's a one-liner fix so the generated lexer code with multiple states will now compile: #268

I'm working on some tests to add that will highlight some of the issues I've found in the generated lexer and maybe fixes for them if not too complicated.

alecthomas · 2022-09-24T04:41:40Z

Magic, thanks!

alecthomas · 2022-09-24T04:42:48Z

re. your last point, yes the generated lexer has diverged in terms of functionality already. I need to come up with a way for tests to pick this divergence up in future.

klondikedragon · 2022-09-24T18:31:55Z

@alecthomas - Check out #269 which is my attempt to have the current lexer test suite also cover generated lexers, and to make it easy to add new test cases that test both the reflection lexer and generated lexer without much duplicate code. The generated lexer is so much faster than the reflection lexer, I'm hoping this will make it easier to maintain at least a subset of features in the generated lexer. Participle has unparalleled developer ergonomics, and if we can get generated lexer & parser production-ready, it will have leading performance as well!

alecthomas · 2022-09-26T01:04:49Z

While I very much appreciate the effort, and I definitely don't want to discourage you from further contributions, I would prefer something a bit simpler to maintain, and a bit more targeted. I'm imagining something like a single "specification" lexer that checks against a spec, and cross-checks a generated lexer against the runtime lexer. I'm happy to flesh this out further if you are interested in contributing it.

klondikedragon · 2022-09-27T05:32:49Z

If I understand what you're envisioning correctly, it would be some single set of lexer rules that would try to exercise all of the different features of the lexer, and then different unit tests that would use that one generated lexer to try and cover all parts of the lexer? It seems possible but to me harder to maintain than the proposed approach, as it would require a different set of unit tests than what is currently there (current unit tests start from different lexers with one specific element its testing).

Perhaps I'm missing how a single generated test lexer might be easier to maintain? Right now running the Makefile will immediately regenerate all of the JSON files and all of the generated test lexers, and the generated lexers also match the current unit tests so they can stay in sync as the unit tests change together. If there were to be one specification lexer that tried to cover all parts of the lexer it seems that would duplicate a lot of the work of the current test suite and would have to be maintained as well as the lexer evolves.

Having the goal of easy maintenance makes a lot of sense to keep time burden of any changes low. I'm interested in helping and open to ideas how to help in that direction!

alecthomas · 2022-09-27T06:35:26Z

Right now the lexer tests are pretty ad-hoc - it's basically arbitrary if a particular feature is exercised or not. The new lexer would be a more systematic approach to this, and once complete, we could remove overlapping tests from the existing suite. I would envisage that there would be only one test input that tests every feature of the lexer, though that might be a bit hard to debug, so that might not be a great idea. It would need some experimentation.

I think it would also be good if the proposed test system generated the codegenned lexer on the fly for each test run, probably to a temporary directory, so that no manual steps are required when updating the lexer/tests. This would also function as a kind of integration test for the code generation itself.

alecthomas · 2022-09-27T07:31:42Z

I'll have a bit of a poke around tonight.

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas · 2022-09-27T09:48:57Z

This is basically what I had in mind. Interestingly, these tests shouldn't attempt to capture validation errors (ie. malformed stateful lexers), those should remain in the existing test suite.

alecthomas · 2022-09-27T10:17:54Z

Actually I think these should be more correctly called conformance tests.

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

klondikedragon · 2022-09-28T02:53:15Z

@alecthomas - Your approach that will regenerate the lexer as part of the test is certainly nice. And the use of build tags to avoid causing the main package build to break if the generated lexer is missing/broken is also very nice.

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

klondikedragon · 2022-10-16T02:43:25Z

@alecthomas - anything I might be able to do to help with the conformance tests & generated lexer? in the draft PR it looks like you were able to make some substantial progress on the gen lexer supporting backreferences. If that was finished up (looks like pretty extensive modifications) I could add some tests for \b and try to fix that, and possibly some things I've found on grouping.

alecthomas · 2022-10-22T13:19:27Z

Hey sorry for the delay in responding. I did make a bit of progress on backrefs, as it turns out I had prepared for them in the generated lexer but for some reason didn't complete it. I didn't have time to completely finish support, so I've disabled the test for now in order to get that PR in.

Now that it's in, any further improvements you'd like to make would be most welcome!

alecthomas · 2022-10-22T13:20:55Z

I would also really like to come up with some automated way to catch these kinds of issues in the future. That is, new features being added to the runtime lexer that aren't also added to the generated lexer. However, I can't think of a reliable way to do that, so for now we'll just have to try and remember to add new features to the conformance tests.

klondikedragon · 2022-10-26T03:17:31Z

@alecthomas - see what you think of the approach in #274. It just adds a real basic test as a first example to see if you are OK with the direction.

Hopefully the community will add extensively to the conformance tests over time, and then as the generated lexer (and future gen parser) are incorporated into other projects, their unit tests will catch problems early too. And then any problems that are found the community could contribute specific tests to the conformance test suite, etc. My project has hundreds of test cases over its specific lexer/parser and are catching some specific issues I'll be adding (and trying to fix if possible) to the conformance suite, and hopefully lots more projects will do the same as the gen lexer is becoming more ready.

So it feels a little bit like a chicken and egg problem, but as the genlexer is getting good ergonomics now, hopefully it will get more adoption and better coverage and so on!

klondikedragon · 2022-10-26T06:06:51Z

I've updated #274 to fix 3 different bugs in the generated lexer and added conformance tests that cover each of these cases. See what you think.

petee-d · 2022-11-08T13:35:42Z

@alecthomas can you give a more specific example of what adopting the idea of serializing the parser grammar nodes would look like? The reflection-related stuff would indeed be tricky to serialize. Trying to serialize every bit of information about the reflected types the generator might possibly need is probably not only too fragile, but may even be impossible in some cases (custom.parseFn).

I'm not sure such large effort would indeed improve usability & convenience anyway... Serializing the grammar to JSON would still be a very manual step requiring the user to code something. True, it would be something very common (JSON serialization and outputting it to a file), but I feel like adding some tooling to easily generate the code & store it to a file by calling some generic function (or a method on the build grammar) would be less work for the user.

alecthomas · 2022-12-03T23:33:42Z

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

Having now experimented with this some more (go/types and golang.org/x/tools/go/packages specifically), I find this more and more appealing. I have not fully thought through how this would work in practice.

Related, my idealistic perfect solution would be that there wouldn't be any runtime dependencies on Participle at all. Whether that's feasible or not, I'm not sure, but with a generated lexer and generated parser, the runtime dependencies would be fairly minimal. Probably only the PeekingLexer, Token, and Error types, at that point.

petee-d · 2022-12-03T23:42:42Z

I've played with packages too and I'm less optimistic it can be done. Maybe yes, but I'm not sure the user experience for generating code would be significantly better.

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

alecthomas · 2022-12-04T01:06:31Z

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

I think let's just go with this approach. We can always refine it or alter approaches later.

alecthomas mentioned this issue Aug 14, 2022

Generating parser code #213

Open

alecthomas added a commit that referenced this issue Sep 27, 2022

Add the seed for specification tests.

0fcc626

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas mentioned this issue Sep 27, 2022

Conformance tests for the runtime and generated lexers. #270

Merged

alecthomas added a commit that referenced this issue Sep 27, 2022

Add the seed for specification tests.

42e937f

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas added a commit that referenced this issue Sep 27, 2022

Conformance tests for the runtime and generated lexers.

31a2b99

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas added a commit that referenced this issue Sep 27, 2022

Conformance tests for the runtime and generated lexers.

eeb0420

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas added a commit that referenced this issue Sep 28, 2022

Conformance tests for the runtime and generated lexers.

f854ef6

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

alecthomas added a commit that referenced this issue Sep 28, 2022

Conformance tests for the runtime and generated lexers.

fa71ac8

The goal is to have a single lexer definition that exercises all the functionality of the stateful lexer and generated equivalent. See #264

This was referenced Oct 30, 2022

Ideas for gen lexer supporting "give back" for quest #276

Open

Peeking lexer optimizations #263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline long term maintenance of lexer/parser code generators #264

Streamline long term maintenance of lexer/parser code generators #264

alecthomas commented Aug 14, 2022 •

edited

alecthomas commented Sep 4, 2022

klondikedragon commented Sep 22, 2022

klondikedragon commented Sep 24, 2022

alecthomas commented Sep 24, 2022

alecthomas commented Sep 24, 2022

klondikedragon commented Sep 24, 2022

alecthomas commented Sep 26, 2022

klondikedragon commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

klondikedragon commented Sep 28, 2022

klondikedragon commented Oct 16, 2022

alecthomas commented Oct 22, 2022

alecthomas commented Oct 22, 2022

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

petee-d commented Nov 8, 2022 •

edited

alecthomas commented Dec 3, 2022

petee-d commented Dec 3, 2022

alecthomas commented Dec 4, 2022

Streamline long term maintenance of lexer/parser code generators #264

Streamline long term maintenance of lexer/parser code generators #264

Comments

alecthomas commented Aug 14, 2022 • edited

Problem 1: divergence

Problem 2: ergonomics

Proposal: serialisable form for both lexers and parse trees

alecthomas commented Sep 4, 2022

klondikedragon commented Sep 22, 2022

klondikedragon commented Sep 24, 2022

alecthomas commented Sep 24, 2022

alecthomas commented Sep 24, 2022

klondikedragon commented Sep 24, 2022

alecthomas commented Sep 26, 2022

klondikedragon commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

alecthomas commented Sep 27, 2022

klondikedragon commented Sep 28, 2022

klondikedragon commented Oct 16, 2022

alecthomas commented Oct 22, 2022

alecthomas commented Oct 22, 2022

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

petee-d commented Nov 8, 2022 • edited

alecthomas commented Dec 3, 2022

petee-d commented Dec 3, 2022

alecthomas commented Dec 4, 2022

alecthomas commented Aug 14, 2022 •

edited

petee-d commented Nov 8, 2022 •

edited