Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove $unsupported tokens in the regex lexers #269

Open
katef opened this issue Oct 10, 2020 · 1 comment
Open

Remove $unsupported tokens in the regex lexers #269

katef opened this issue Oct 10, 2020 · 1 comment

Comments

@katef
Copy link
Owner

katef commented Oct 10, 2020

Currently we produce $unsupported for various things (e.g. lookahead, lookbehind in pcre) and then error about it.

My suggestion instead is that we do lex these correctly, and then error about them in the parser instead. This moves the concept of unsupportedness along a layer.

Eventually I'd like to also construct AST nodes for these, and only error about the unsupportedness when we come to do the AST->NFA conversion. This way we'd also have support for these features for e.g. AST -> regexp rendering (where FSM are not involved), but perhaps also opportunities to deal with them by AST rewriting.

@sfstewman
Copy link
Collaborator

sfstewman commented Oct 10, 2020

I think this makes a lot of sense.

For the PCRE dialect, $unsupported currently falls into four buckets:

  • Word boundary, capture groups, and multiline things that libfsm could potentially support: \b, \B, \K, \Z, and \G.
  • Back references. It makes sense to include these in the AST; simple forms like (foo)\1 can be transformed into (foo)(?:foo) which is compatible with DFAs and linear scanning.
  • Positive/negative look-ahead and look-behind assertions. We may be able to transform these into something compatible with linear scanning.
  • Ways to control backtracking: atomic qualifiers and (*VERB) forms like (*COMMIT) and (*PRUNE). These are so specific to backtracking matchers (and PCRE) that I'm not sure if we want to include them. On the other hand, there aren't that many forms of this, so it may make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants