Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for backreferences #132

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

benjie
Copy link

@benjie benjie commented Sep 8, 2019

I'm trying to use moo to parse PostgreSQL syntax and everything's amazing except I can't parse the dollar-quoted strings because they need backreferences.

https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING

This PR adds backreference support.

This should be a completely non-breaking change; I've even persisted the old behaviour of throwing an error for capture groups if there are no backreferences.

I'm used to my code being auto-formatted by prettier and/or ESLint so apologies if I've made any formatting errors.

@nathan
Copy link
Collaborator

nathan commented Sep 8, 2019

👎 This falls afoul of our commandment not to parse RegExps. The reason we don't is because there are lots of edge cases and compatibility oddities we don't want to test or deal with; i.e., it's extremely difficult to do correctly. Off the top of my head, your code treats the following incorrectly:

  • /[\1]/ (matches SOH, U+0001)
  • /\1/ (matches SOH, U+0001)
  • /()[\1]/ (matches SOH, U+0001; your code changes it to match STX, U+0002).
  • /()\9/ (matches the number 9, U+0039)
  • /\0/u (matches NUL, U+0000; doesn't use Annex B octal escapes)
  • /()()()()()()()()()(x)\10/ (matches xx, U+0078 U+0078; actually is a backreference, but your code spuriously rejects it)

@benjie
Copy link
Author

benjie commented Sep 8, 2019

That's fair, I was not comfortable with having to even avoid the \\1 situation.

Do you have a better way of handling this particular issue? At first I was going to solve it by using states; plan was to match /\$([\w_][\w\d_]*)?\$/ (the match for which I shall refer to as "the phrase", which could be for example $FooBar$) and then drop into a different state that would skip over everything until it sees the exact same phrase again, whereupon it would exit back to the normal state. However I couldn't figure out how to exit the new state when this specific phrase was seen again since pop is a fixed boolean, and the specific phrase we're looking for isn't known until parse-time. I considered submitting a PR to add function forms of pop, push and next but I didn't fully think this idea through as backreferences seemed simpler at the time 🤦‍♂️.

In case you're not familiar with the dollar quoting, here's some examples:

-- Empty string
select $$$$;

-- String with a dollar in it
select $$Blah $ blah$$;

-- Dollar on its own
select $_$$$_$;
-- or
select $anythinghere$$$anythinghere$;

-- String with `$$` in it
select $_$Blah $$ blah$_$;

-- Using a different tag so we can include both $$ and $_$
select $CustomTag$Blah $$ $_$ blah$CustomTag$;

(I'm very new to parsers/lexers/etc)

Any advice you can give would be very helpful.

@benjie
Copy link
Author

benjie commented Sep 8, 2019

This doesn't solve the "don't parse regexps" commandment, but I've pushed another commit to address your examples:

  • /[\1]/ (matches SOH, U+0001)

NO LONGER MATCHED: we no longer look for backreferences if there aren't any capture groups; this should bring this back to being a non-breaking change.

  • /\1/ (matches SOH, U+0001)

NO LONGER MATCHED: I've made it so we only look for backreferences when there is a capture group, so we're not matching this any more.

ALREADY INVALID: this isn't valid in moo currently anyway because it references the first capture group which moo makes via reCapture. None of the octal escapes matching or under the number of rules you have are safe. I suggest we have a global "no octal escapes" rule for this reason - otherwise adding a new rule might break existing octal escapes

  • /()[\1]/ (matches SOH, U+0001; your code changes it to match STX, U+0002).

VALID: this will still cause issues, but the "no octal escapes" rule should address this.

  • /()\9/ (matches the number 9, U+0039)

ALREADY INVALID: this won't match the number 9 if you have more than 8 rules, in which case it matches the moo-generated 9th capture group.

  • /\0/u (matches NUL, U+0000; doesn't use Annex B octal escapes)

NO LONGER MATCHED: we no longer match references that start with a zero (i.e. /\\0[0-9]*/)

  • /()()()()()()()()()(x)\10/ (matches xx, U+0078 U+0078; actually is a backreference, but your code spuriously rejects it)

FIXED: oops, yeah I wrote \d rather than \d+; fixed and added a test for this.


I'm not expecting you to merge this; your very sensible commandment definitely makes sense. I do think it would be wise to advise users not to use octal escapes in the README independent of this PR.

@damoclark
Copy link

Hi @benjie

This is a cosmic co-incidence because I too am writing a parser for Postgresql using Moo and nearley.

And I have run afoul of the exact same issue - dollar quoted strings.

At first I was going to solve it by using states; plan was to match /\$([\w\_][\w\d_]*)?\$/ (the match for which I shall refer to as "the phrase", which could be for example $FooBar$) and then drop into a different state that would skip over everything until it sees the exact same phrase again, whereupon it would exit back to the normal state.

I've been thinking about this problem, and propose this type of approach, but in a way thou shall not break the do not parse RegExp rule by omitting the RegExp merging optimisation for such cases.

Moo will compile 'em down to a single RegExp for performance.

Presently, you can match tokens using either literal strings or RegExps. What if we allow the match property to accept a Function as well? Then using closures, the function can return either a RegExp or a literal String that can be computed using values from the scope of the closure, which in turn can be used by the Moo lexer.

To capture values during lexing, and make them available to the closure in subsequent matching, we could introduce a new class to the Moo API. E.g.

class capture {

	constructor() {
		this.val = '' ;
	}

	set value(val) {
		this.val = val ;
	}

	get value() {
		return this.val ;
	}

	get escaped() {
		return this.val.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') ;
	}

	toString() {
		return this.value ;
	}
}

Moo could be used somehow like the following (though it may not be strictly syntax correct):

const moo = require('moo') ;

let dolq = '$' ;
let dolqlabel = /[A-Za-z\200-\377_]+?/ ;

let c = new moo.capture() ;

	moo.state({
		start: {
			dolqstart: {match: dolq, push: 'state_dolq'},
		},
		state_dolq: {
			// 'capture' passes the moo.capture instance so moo knows where to store what is matched by 'match', so it can be recalled later
			dolqlabel: {match: dolqlabel, capture: c},
			dolqstop: {match: dolq, pop: true},
		},
		state_dolq_literal: {
			dolq_stop: {match: () => `$${c}$`, pop: true},
			// RE match any string up to, but not including our captured delimiter 'd' (and its literal method has escaped for our RE)
			dolq_literal: {match: () => new RegExp(`.*?(?=\\$${c.escaped}\\$)`), lineBreaks: true},
			// If no closing dollar quote, matching everything we have in buffer
			dolq_literal: {match: () => new RegExp(`.*?(?!\\$${c.escaped}\\$)`), lineBreaks: true}
		}
	}) ;
} ;

Obviously with this implementation, there will be a performance impact because the RE merging optimisation won't be possible.

From the source:

      var pat = reUnion(match.map(regexpOrLiteral))

and this source:

      parts.push(reCapture(pat))

These lines appear to implement the consolidation of all the matches gives as either REs or literal strings into a single RE, and then wrap it in a RE capture so lexer can split up the results.

What I propose is, for the case of dynamic token matching, not to optimise the lexer into a single RE, but leave it as an array of REs, and execute them in turn through a loop until a match is found. If the object type of a match instance is of type Function, then call it and use whatever it returns - either a literal string or a RegExp - to perform the next match.

This unoptimised algorithm would only be in play when dynamic lexing is to be performed and only for the scope of the state in which it occurs. In all other cases, the existing algorithm prevails.

The impact is probably quite small where it is used because the dynamic parsing would happen within a higher state level in the stack than the main parser. So the optimisation would only be lost when in that state of the lexer.

And a slower implementation of dynamic lexing is better than no implementation of dynamic lexing.

Thoughts @benjie and @nathan ?

Damo.

@benjie
Copy link
Author

benjie commented Sep 10, 2019

I'm far too inexperienced at lexers to pass comment, sorry.

@nathan
Copy link
Collaborator

nathan commented Sep 11, 2019

@tjvr thoughts? I really don't think we should use multiple RegExps at runtime, but parsing RegExps to find and alter backreferences still seems error-prone. However, it might be worthwhile here, because this type of literal is quite common in programming languages.

All the same, if I remember correctly, we dropped support for capture groups because the use case didn't justify the additional complexity, and this adds even more complexity.

@tjvr
Copy link
Collaborator

tjvr commented Sep 11, 2019

All good points @nathan.

You're right we dropped support for capture groups because they added complexity, but also because we added value transforms which solve the same problem. I can't think of an obvious workaround for backreferences.

Warning against using \9 etc in Moo rules sounds like a good idea.

I think you're right that we won't merge this, since it requires parsing RegExps. I'm afraid I can't think of a good workaround right now, but I might think about it. :)

Sent with GitHawk

@damoclark
Copy link

hi @nathan and @tjvr

I really don't think we should use multiple RegExps at runtime

Are there other reasons apart from the performance impact nathan?

The lexer could only fall back to multiple REs for states that require it, and it could determine this at compile time.

Or have I underestimated the performance impact of this approach? Like @benjie , I'm no expert on lexing.

I'm afraid I can't think of a good workaround right now, but I might think about it. :)

Glad to be challenging great minds. :)

D.

@benjie
Copy link
Author

benjie commented Oct 15, 2019

Hey @tjvr; interested to hear if you've managed to think of a better solution to this. I'm keen to use it for tokenising PostgreSQL queries, but cannot see an easy way to achieve it without this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants