Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Parsing Example #993

Open
bd82 opened this issue Aug 7, 2019 · 9 comments
Open

Fuzzy Parsing Example #993

bd82 opened this issue Aug 7, 2019 · 9 comments

Comments

@bd82
Copy link
Member

bd82 commented Aug 7, 2019

See:

This could probably be done using Token Categories to match against "all" kinds of tokens combined with
an alternation which includes the one option we care about.

e.g:

      $.RULE("fuzzyConstantFinder", () => {
        $.OR([
          // We may need a GATE here, e.g using backtracking to ensure we never incorrectly enter
         // the constant alternative. 
          {ALT: () => $.SUBRULE($.constant)}, //
          {ALT: () => $.CONSUME(AnyToken)}, // AnyToken should be defined using Token Categories.
        ]);
      });

      $.RULE("constant", () => {
           $.CONSUME(Public);
           $.CONSUME(Static);
           $.CONSUME(Final);
           $.CONSUME(Int);
           $.CONSUME(Ident);
      });
@matthew-dean
Copy link
Contributor

The documentation for Token Categories is rather sparse. What's a practical application?

@bd82
Copy link
Member Author

bd82 commented Aug 8, 2019

Basically you can specify that multiple tokens are of the category X.
and than match against said X in the Parser. If you think of the CONSUME(X) method as:

  • Eat next token iff next token is instanceof X.

Then token categories allow you to define multiple "inheritance" between tokens.

Example:

So basically this can be expended to do a "MatchALL" Token which could be used as part of a fuzzy parsing solution.

@Sciumo
Copy link

Sciumo commented Aug 8, 2019

A good example of fuzzy parsing would be able to return a list of commands and whatever text is between commands for further parsing.

machine:chevrotain user$ ls
CONTRIBUTING.md		greenkeeper.json	readme.md
LICENSE.txt		lerna.json		tslint.json
NOTICE.txt		package.json		yarn.lock
examples		packages
machine:chevrotain user$ cat NOTICE.txt 
Copyright (c) 2015-2019 SAP SE or an SAP affiliate company.
machine:chevrotain user$ 

@bd82
Copy link
Member Author

bd82 commented Aug 8, 2019

I am not sure this can even be represented as a context free grammar.
Can you assume some delimiters between commands? For example the "$" sign never being present in the command output? or perhaps the "machine:chevrotai user$" prefix appearing before every command?

@Sciumo
Copy link

Sciumo commented Aug 8, 2019

This task is accomplished using regexp scanners. IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time. Scanning the resulting tokens for re-tokenizing and possibly complete parsing. My goal is to learn tokens at run time, and restart the process. Fuzzy parsing is a means of implementing a learning parsing system.

@bd82
Copy link
Member Author

bd82 commented Aug 8, 2019

IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time.

Interesting, you could effectively use some heuristics to identify the delimiter machine:chevrotain user$ and then dynamically build a lexer that would be able to fuzzy scan those commands.

In your case the "grammar" itself seems trivial so I am not sure there would be any need for a Chevrotain Parser part.

@bd82
Copy link
Member Author

bd82 commented Aug 8, 2019

BTW you can dynamically create Chevrotain Parsers as well as Lexers using the custom APIs feature

Granted this has limitations:

And it also requires the use of EVAL (not supported when there is a context security policy, e.g many websites).

But could still be interesting...

@bd82
Copy link
Member Author

bd82 commented Aug 10, 2019

I've started playing around with a similar scenario which required consuming any kind of tokens between a "--" and a semiColon.

Basically the CSS 3 custom property syntax:

You can inspect the current state of the example here:

@bd82
Copy link
Member Author

bd82 commented Sep 5, 2019

Consider expanding the fuzzy parsing example to a scenario in which the fuzzy matching acts as a default fall back.

e.g: 3 Alternatives, and the 3rd one being the fuzzy one which could conflict with the first 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants