Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline long term maintenance of lexer/parser code generators #264

Open
2 of 6 tasks
alecthomas opened this issue Aug 14, 2022 · 22 comments
Open
2 of 6 tasks

Streamline long term maintenance of lexer/parser code generators #264

alecthomas opened this issue Aug 14, 2022 · 22 comments

Comments

@alecthomas
Copy link
Owner

alecthomas commented Aug 14, 2022

I have some concerns with the long term maintenance of both the lexer and parser code generators which we should solve (or at least, not make worse) before getting these changes into master.

Problem 1: divergence

My first concern is that as the runtime code evolves, it diverges from the code generators. This has already occurred with the lexer, where the generated code does not handle lookahead. There will need to be some tests to detect this somehow. For the parser, enumerating the nodes and ensuring they're all handled by the generator (somehow) might work.

  • Conformance tests for lexer
  • Conformance tests for parser

Problem 2: ergonomics

There are also some ergonomic issues with generating the code. Specifically, having to have an implementation of the runtime code lying around as the "source of truth".

Proposal: serialisable form for both lexers and parse trees

  • Serialisable lexer
  • Create participle gen lexer
  • Serialisable parser
  • Create participle gen parser

One solution to this is for the code generators to be decoupled completely from the code. The lexer and parser would be extended to be JSON marshallable and the code generators would become standalone binaries that could ingest this serialised form and output code. This might be non-trivial for the parser because it is tightly coupled to reflection - TBD.

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

@alecthomas
Copy link
Owner Author

This has been implement for the lexer in bfe7c69.

@klondikedragon
Copy link
Contributor

@alecthomas - This is a nice separation to have!

I've tested the update in bfe7c69, it looks like there is one more issue in the code generated from the serialized JSON definition of the lexer. Imagine that the lexer codegen was invoked using something like:

~/go/bin/participle gen lexer --name GeneratedLexer lexer < internal/lexer/lexer.json | gofmt > internal/lexer/lexer_generated.go

And imagine that the lexer definition had included some rules that looked like:

	"AllStrings": {
		{Name: `DoubleQuoteStringBegin`, Pattern: `"`, Action: lexer.Push(`DoubleQuoteString`)},
		{Name: `SingleQuoteStringBegin`, Pattern: `'`, Action: lexer.Push(`SingleQuoteString`)},
		{Name: `BacktickStringBegin`, Pattern: "`", Action: lexer.Push(`BacktickString`)},
	},

The codegen template that generates code to push a new lexer state intends to refer to the state struct name that includes the name specified on the command line (e.g., GeneratedLexer in the above example)
bfe7c69#diff-5119a11cbfb79273b05b2942dfb351ab4bf736ca7cf91d773f360e75bcb07eb4R173

However, what the template generates instead is to put the state struct name using the name of the rule instead of the top level struct name. For example, it generates

	switch state.name {
	case "AllStrings":
		if match := matchDoubleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -22
			groups = match[:]
			l.states = append(l.states, lexerDoubleQuoteStringBeginState{name: "DoubleQuoteString"})
		} else if match := matchSingleQuoteStringBegin(l.s, l.p); match[1] != 0 {
			sym = -23
			groups = match[:]
			l.states = append(l.states, lexerSingleQuoteStringBeginState{name: "SingleQuoteString"})
		} else if match := matchBacktickStringBegin(l.s, l.p); match[1] != 0 {
			sym = -24
			groups = match[:]
			l.states = append(l.states, lexerBacktickStringBeginState{name: "BacktickString"})
		}

In all cases in this example, the proper name of the struct would have been lexerGeneratedLexerState.

Manually changing this the generated code compiles as before.

I've also been debugging why the generated lexer isn't producing 100% accurate results vs the reflection lexer. It seems it may not be handling the special lookahead word boundary \b quite the same, and in another case it may not be greedily matching the same in a regexp with a complex set of nested non-matching groups. I have to have some simple rules that highlight the gaps in the codegen lexer shortly.

@klondikedragon
Copy link
Contributor

@alecthomas - here's a one-liner fix so the generated lexer code with multiple states will now compile: #268

I'm working on some tests to add that will highlight some of the issues I've found in the generated lexer and maybe fixes for them if not too complicated.

@alecthomas
Copy link
Owner Author

Magic, thanks!

@alecthomas
Copy link
Owner Author

re. your last point, yes the generated lexer has diverged in terms of functionality already. I need to come up with a way for tests to pick this divergence up in future.

@klondikedragon
Copy link
Contributor

@alecthomas - Check out #269 which is my attempt to have the current lexer test suite also cover generated lexers, and to make it easy to add new test cases that test both the reflection lexer and generated lexer without much duplicate code. The generated lexer is so much faster than the reflection lexer, I'm hoping this will make it easier to maintain at least a subset of features in the generated lexer. Participle has unparalleled developer ergonomics, and if we can get generated lexer & parser production-ready, it will have leading performance as well!

@alecthomas
Copy link
Owner Author

While I very much appreciate the effort, and I definitely don't want to discourage you from further contributions, I would prefer something a bit simpler to maintain, and a bit more targeted. I'm imagining something like a single "specification" lexer that checks against a spec, and cross-checks a generated lexer against the runtime lexer. I'm happy to flesh this out further if you are interested in contributing it.

@klondikedragon
Copy link
Contributor

If I understand what you're envisioning correctly, it would be some single set of lexer rules that would try to exercise all of the different features of the lexer, and then different unit tests that would use that one generated lexer to try and cover all parts of the lexer? It seems possible but to me harder to maintain than the proposed approach, as it would require a different set of unit tests than what is currently there (current unit tests start from different lexers with one specific element its testing).

Perhaps I'm missing how a single generated test lexer might be easier to maintain? Right now running the Makefile will immediately regenerate all of the JSON files and all of the generated test lexers, and the generated lexers also match the current unit tests so they can stay in sync as the unit tests change together. If there were to be one specification lexer that tried to cover all parts of the lexer it seems that would duplicate a lot of the work of the current test suite and would have to be maintained as well as the lexer evolves.

Having the goal of easy maintenance makes a lot of sense to keep time burden of any changes low. I'm interested in helping and open to ideas how to help in that direction!

@alecthomas
Copy link
Owner Author

Right now the lexer tests are pretty ad-hoc - it's basically arbitrary if a particular feature is exercised or not. The new lexer would be a more systematic approach to this, and once complete, we could remove overlapping tests from the existing suite. I would envisage that there would be only one test input that tests every feature of the lexer, though that might be a bit hard to debug, so that might not be a great idea. It would need some experimentation.

I think it would also be good if the proposed test system generated the codegenned lexer on the fly for each test run, probably to a temporary directory, so that no manual steps are required when updating the lexer/tests. This would also function as a kind of integration test for the code generation itself.

@alecthomas
Copy link
Owner Author

I'll have a bit of a poke around tonight.

alecthomas added a commit that referenced this issue Sep 27, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
alecthomas added a commit that referenced this issue Sep 27, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
@alecthomas
Copy link
Owner Author

This is basically what I had in mind. Interestingly, these tests shouldn't attempt to capture validation errors (ie. malformed stateful lexers), those should remain in the existing test suite.

@alecthomas
Copy link
Owner Author

Actually I think these should be more correctly called conformance tests.

alecthomas added a commit that referenced this issue Sep 27, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
alecthomas added a commit that referenced this issue Sep 27, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
@klondikedragon
Copy link
Contributor

@alecthomas - Your approach that will regenerate the lexer as part of the test is certainly nice. And the use of build tags to avoid causing the main package build to break if the generated lexer is missing/broken is also very nice.

alecthomas added a commit that referenced this issue Sep 28, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
alecthomas added a commit that referenced this issue Sep 28, 2022
The goal is to have a single lexer definition that exercises all the
functionality of the stateful lexer and generated equivalent.

See #264
@klondikedragon
Copy link
Contributor

@alecthomas - anything I might be able to do to help with the conformance tests & generated lexer? in the draft PR it looks like you were able to make some substantial progress on the gen lexer supporting backreferences. If that was finished up (looks like pretty extensive modifications) I could add some tests for \b and try to fix that, and possibly some things I've found on grouping.

@alecthomas
Copy link
Owner Author

Hey sorry for the delay in responding. I did make a bit of progress on backrefs, as it turns out I had prepared for them in the generated lexer but for some reason didn't complete it. I didn't have time to completely finish support, so I've disabled the test for now in order to get that PR in.

Now that it's in, any further improvements you'd like to make would be most welcome!

@alecthomas
Copy link
Owner Author

I would also really like to come up with some automated way to catch these kinds of issues in the future. That is, new features being added to the runtime lexer that aren't also added to the generated lexer. However, I can't think of a reliable way to do that, so for now we'll just have to try and remember to add new features to the conformance tests.

@klondikedragon
Copy link
Contributor

@alecthomas - see what you think of the approach in #274. It just adds a real basic test as a first example to see if you are OK with the direction.

Hopefully the community will add extensively to the conformance tests over time, and then as the generated lexer (and future gen parser) are incorporated into other projects, their unit tests will catch problems early too. And then any problems that are found the community could contribute specific tests to the conformance test suite, etc. My project has hundreds of test cases over its specific lexer/parser and are catching some specific issues I'll be adding (and trying to fix if possible) to the conformance suite, and hopefully lots more projects will do the same as the gen lexer is becoming more ready.

So it feels a little bit like a chicken and egg problem, but as the genlexer is getting good ergonomics now, hopefully it will get more adoption and better coverage and so on!

@klondikedragon
Copy link
Contributor

I've updated #274 to fix 3 different bugs in the generated lexer and added conformance tests that cover each of these cases. See what you think.

@petee-d
Copy link
Contributor

petee-d commented Nov 8, 2022

@alecthomas can you give a more specific example of what adopting the idea of serializing the parser grammar nodes would look like? The reflection-related stuff would indeed be tricky to serialize. Trying to serialize every bit of information about the reflected types the generator might possibly need is probably not only too fragile, but may even be impossible in some cases (custom.parseFn).

I'm not sure such large effort would indeed improve usability & convenience anyway... Serializing the grammar to JSON would still be a very manual step requiring the user to code something. True, it would be something very common (JSON serialization and outputting it to a file), but I feel like adding some tooling to easily generate the code & store it to a file by calling some generic function (or a method on the build grammar) would be less work for the user.

@alecthomas
Copy link
Owner Author

Another option would be to make standalone binaries that parse the Go code directly, making the Go AST and compile-time type system the intermediate form. This would be much more complicated though.

Having now experimented with this some more (go/types and golang.org/x/tools/go/packages specifically), I find this more and more appealing. I have not fully thought through how this would work in practice.

Related, my idealistic perfect solution would be that there wouldn't be any runtime dependencies on Participle at all. Whether that's feasible or not, I'm not sure, but with a generated lexer and generated parser, the runtime dependencies would be fairly minimal. Probably only the PeekingLexer, Token, and Error types, at that point.

@petee-d
Copy link
Contributor

petee-d commented Dec 3, 2022

I've played with packages too and I'm less optimistic it can be done. Maybe yes, but I'm not sure the user experience for generating code would be significantly better.

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

@alecthomas
Copy link
Owner Author

Regarding the other point, I also think it would be cool to minimize the dependencies of generated code on the participle package. On the other hand, I also quite like the approach I took, where enabling the generated code is just a matter or passing an option to Build and then being able to use the existing Parse* methods, which already take care of a lot of stuff the generated code doesn't.

I think let's just go with this approach. We can always refine it or alter approaches later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants