Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for gen lexer supporting "give back" for quest #276

Open
klondikedragon opened this issue Oct 28, 2022 · 9 comments
Open

Ideas for gen lexer supporting "give back" for quest #276

klondikedragon opened this issue Oct 28, 2022 · 9 comments

Comments

@klondikedragon
Copy link
Contributor

After #275 my lexer test suite was still failing. It looks like it's because the Quest regex operator (and related) doesn't support "giving back". For example, imagine you have the pattern for matching a 3 or 4 capital letter time zone: [A-Z][A-Z][A-Z]?T -- the generated lexer currently will not match EST, because the current quest operator implementation is fully greedy (never gives back). The generated code will match EST for the first three operations, but that advances the position past the ending T and thus the last literal T does not match. It doesn't backtrack to try the case where the [A-Z]? term was never matched, which would have allowed it to match.

The current generated code looks something like:

// [A-Z][A-Z][A-Z]?T (Concat)
l3 := func(s string, p int) int {
if p = l0(s, p); p == -1 { return -1 }
if p = l0(s, p); p == -1 { return -1 }
if p = l1(s, p); p == -1 { return -1 }
if p = l2(s, p); p == -1 { return -1 }
return p
}

It seems the implementation to support "give back" could be pretty complicated, so I wanted to discuss about possible good directions.

Ideas:

  1. The least efficient but simplest would be to have it rewrite the regexp to remove the quest operator and instead replace the entire expression it was concatenated within with the two (or more variants). e.g., [A-Z][A-Z][A-Z]?T would be rewritten internally to [A-Z][A-Z][A-Z]T|[A-Z][A-Z]T before going through codegen. This could quickly lead to exponential explosion of the number of alternates for any reasonably complex regex, but the code itself would never have to "backtrack". It's simple to implement, but could be horribly slow, doesn't seem like a good option.
  2. Perhaps change the generated functions representing expression (e.g., l3 func above) to return a list of positions, rather than a single position (or an empty list instead of -1). Then the caller would explore all of the possibilities starting from that point (or if there was no match because it got an empty list, would stop and return an empty list right there). It feels like this direction might work, but would require some memory allocation with the returned lists (maybe there is a way to optimize by passing a shared list or using an allocation pool). A sketch of this might look like:
// [A-Z][A-Z][A-Z]?T (Concat)
l3 := func(s string, p int) []int {
	matchedPositions := make([]int, 0)
	for _, nextP := range l0(s, nextP) {
		for _, nextP := range l0(s, nextP) {
			for _, nextP := range l1(s, nextP) {
				for _, nextP := range l2(s, nextP) {
					matchedPositions = append(matchedPositions, nextP)
				}
			}
		}
	}
	return matchedPositions
}

Probably this method could be optimized where it could recognize if a given function can only ever return a single position, and if so avoid the overhead of allocation. e.g., it could look more like:

l3 := func(s string, p int) []int {
        matchedPositions := make([]int, 0)
	if p = l0(s, p); p != -1 {
		if p = l0(s, p); p != -1 {
			for _, p := range l1(s, p) {
				if p = l2(s, p); p != -1 {
					matchedPositions = append(matchedPositions, p)
				}
			}
		}
	}
	return matchedPositions
}

This variant is nice because it can still use p regardless of whether it's in a loop (previous function had multiple possibilities) or just an if statement (previous function had just one possibility). With an allocation pool for the matchedPositions array, maybe it wouldn't actually be that bad (memory complexity would be related to the number of possible open "branches" at any given time)

  1. Perhaps some technique where the caller recognizes the function it's calling represents a quest, and generates the different "paths" in the code (duplicating code for the entire "branch" for the remainder of that concatenation) -- this seems both hard to implement and code size could explode too for complex pressions.

What do you think? Any ideas on better directions?

@alecthomas
Copy link
Owner

Want to add a failing test case in a PR?

@alecthomas
Copy link
Owner

I'm very tired, but instead of rolling back, perhaps if the element preceding the T matches, we could check if it was equal to T and, if so, skip the T as if it had matched. I'm not sure how this would work with capture groups.

@alecthomas
Copy link
Owner

alecthomas commented Oct 28, 2022

In your example 1, the ? should only match against the previous range, not all three I think? So it would be [A-Z]?T => T|[A-Z]T AFAICT? That doesn't seem very unreasonable.

Edit: I think T would get deduplicated by the code generator too

@alecthomas
Copy link
Owner

Oh hah, so interestingly the compiler optimises this away completely. This:

[A-Z][A-Z](T|[A-Z])

Is optimised to this:

[A-Z][A-Z]([A-Z])

@klondikedragon
Copy link
Contributor Author

Thanks for brainstorming about this! Note that [A-Z][A-Z][A-Z]?T is not actually the same as [A-Z][A-Z](T|[A-Z]) -- the former can match both EST and ACDT (and would not match something like ABC) whereas the latter would only match EST and also would match more than things that end in T (e.g., it would match ABC which is not desired).

I was giving a simple example here, but in general it gets harder if the thing operated on by the quest operator is a really complex expression. Imagine a pattern like aaa(bbb(eee)?)?(ccc)?bbb(eee)?ddd -- this could match something like aaabbbddd (skips all quest expressions) or aaabbbeeeddd (skips first 3 quest expressions, but matches the last quest expression -- but greedy matching). When there is a quest expression, it can be expanded using | but it starts to multiply the cases. e.g., aaa(bbb(eee)?)?(ccc)?bbb(eee)?ddd could be rewritten to aaa((((bbb(eee)|bbb))(ccc)bbb((eee)ddd|ddd))|(((bbb(eee)|bbb))bbb((eee)ddd|ddd)))|((ccc)bbb((eee)ddd|ddd)|bbb((eee)ddd|ddd)) (pretty sure I got that right) -- it starts to get worse exponentially as the number of quest operators increases. Maybe if common subexpressions were implemented using the same l function, then there wouldn't be "code explosion" (or maybe you're saying the go compiler is smart enough to recognize that different functions with different names but that have the same code will be compiled down to only one function? that would be certainly make it much easier since then we could be lazy in the codegen and let the compiler do the hard work). I need to give some thought if it might be possible to try hacking this method to get a quick benchmark. Probably it would be in a 'preprocessing' step to use the parsed regex to transform it ("expanding" any quest operators) before passing it into the main regex codegen phase.

I want to see if I can hack out approach number 2 just to benchmark it and see how it compares to current code. I believe the complexity/efficiency could be very similar to the current codegen lexer if there are no quest operators, because it could still use the optimized case where an l function still only returns a single integer position if it's known it can only ever return 0 or 1 possibilities (fully greedy match). In the case where there are potential non-greedy matches (e.g., quest operator), then it would allocate/use memory linear with respect to the number of potential paths explored (one integer per path). That doesn't seem too bad, especially if an allocation pool is used. On the other hand, the loop structure seems like it would not "short circuit" early if a match was found at the end of the expression, and that could make it run quite a bit slower than approach 1.

@alecthomas
Copy link
Owner

Thanks for brainstorming about this! Note that [A-Z][A-Z][A-Z]?T is not actually the same as [A-Z]A-Z -- the former can match both EST and ACDT (and would not match something like ABC) whereas the latter would only match EST and also would match more than things that end in T (e.g., it would match ABC which is not desired).

Yeah I'm aware, it was a typo. The comment before contained the expression under question: [A-Z]?T => T|[A-Z]T

As for the approach, In general I prefer to do the simplest thing possible unless it's clear that it won't be viable. In this case I think it's exceedingly unlikely that there will be complex expressions involving multiple ? in a tokeniser, so I'm not that concerned about combinatorial explosions. The expression aaa(bbb(eee)?)?(ccc)?bbb(eee)?ddd results in a ~260 line match function which, while not great, is pretty unlikely to occur, and even then is probably acceptable.

To be honest, the generated code is super naive. I initially wanted to generate a table-driven DFA matcher like re2dfa, but that would have been quite a bit more work than the current implementation. Even more ideally I would have just used that code directly, but unfortunately it is GPL3 :(

All that is to say, I don't want to invest a significant amount of effort on this implementation. If a lot of work is to be put in, it would be better to put it into a DFA-based implementation.

@klondikedragon
Copy link
Contributor Author

In this case I think it's exceedingly unlikely that there will be complex expressions involving multiple ? in a tokeniser, so I'm not that concerned about combinatorial explosions.

That should probably be true!! in my case maybe I'm doing too much in the tokenizer vs the parser, LOL :) I have a regular expression for a Time token that has 23+ quest operators in it, some on pretty large groups (accepts times in lots of different formats as may be written by a human).

However, it is actually rare that in this expression the quest operator ever needs to "give back" to achieve accuracy (future parts of the pattern match on distinct literals past the quest operator, so normally being fully greedy is perfectly OK and the fast thing to do) -- it would not be good to give up the huge performance/simplicity of a "pure greedy" quest operator implementation, just in the unlikely case that one of these might need to be non-greedy. Maybe for now it's just better to document that the quest operator does not give back in the generated lexer, and let users manually adjust their expressions accordingly? I will try to adjust my expressions manually to avoid reliance on any quest operator to give back, and see if that allows my test suite to pass fully.

As an aside, it looks like the license for re2dfa doesn't put a license on generated code, only on the generator itself. Now that the participle lexer generator is separated from the core of the library itself, it might be possible to use re2dfa by the generator first generating an input file and feeding that to re2dfa with all of the relevant tokens, and then using the code generated by re2dfa in the code that the participle lexer generator generates. That would avoid any participle code being GPL'd, and the output would also be free of license constraints, while still building on re2dfa. Probably re2dfa would need some enhancements (maybe adding the notion of build tags etc.). To your point though, the lexer patterns should be simple and so all this complexity would perhaps be overkill. I'm super close to my test suite fully passing using the generated lexer so I'm hoping to start contributing some energy towards the generated parser next :)

@alecthomas
Copy link
Owner

I have a regular expression for a Time token that has 23+ quest operators in it

😳 wow...I would love to see that if you don't mind.

I'm super close to my test suite fully passing using the generated lexer so I'm hoping to start contributing some energy towards the generated parser next :)

Awesome! That will require some thought I think, I'm still not sure what the best approach is to be honest.

@klondikedragon
Copy link
Contributor Author

Sure, I'll DM you the regex :)

With handmade changes to the lexer regex in a couple places to eliminate any "non-greedy" quest operator, my lexer/parser test suite is now passing! (about 300 tests) - I agree that most lexers should have simple regexp, so it seems like just clearly documenting that the generated lexer is fully greedy for quest operators is perhaps the way to go, vs something more complex / harder to maintain / slower. I'll submit a PR to suggest some adds to the readme about that.

Including the optimizations you just committed today, the generated lexer is seeing an 8x speedup in lexing, and still with zero allocs! (the 4 allocs in the genlexer benchmark I believe are just to allocate the initial lexer state)

BenchmarkReflectLexer-8            15061             80648 ns/op            3407 B/op        192 allocs/op
BenchmarkGeneratedLexer-8         106869             10629 ns/op             384 B/op          4 allocs/op

This is a huge improvement and definitely good enough for my app.

The approach taken by @petee-d for a generated parser at least for my use case gave an overall application speedup of 10x-15x and reduced allocations 15x-30x (some details given commenting on #213). There are the ergonomic issues raised in #264, and prep work to rebase was still in progress in #263, it seems the general algorithm there was quite fast if these can be resolved. I haven't taken a detailed look at the code yet to understand the approach yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants