Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically changing how tokens are recognized #296

Open
chuckcscccl opened this issue Apr 14, 2023 · 1 comment
Open

Dynamically changing how tokens are recognized #296

chuckcscccl opened this issue Apr 14, 2023 · 1 comment

Comments

@chuckcscccl
Copy link

chuckcscccl commented Apr 14, 2023

I am writing parser generators and would like to experiment with using logos to create lexical scanners, which should be faster than the ones that I've coded. Before I proceed, however, I'd like to know if it's possible to change the way in which tokens are recognized at runtime. I have two examples of where I would need this kind of ability:

  1. When parsing C using the published ANSI C grammar, if you have:
      typedef unsigned int uint;
      uint x = 2;

The first occurrence of "unit" should be recognized as an IDENTIFIER token but the second should be recognized as a TYPE_NAME. Confusing the two would lead to conflicts within the grammar. This can be done by constructing a symbol table as we parse, and writing "semantic actions" that can affect how the lexer recognizes certain alphanumeric sequences.

  1. Another example is when parsing a generic type expression such as

    HashMap<Keytype,HashSet<Valuetype>>

The problem here is that the ">>" symbol also represents the right-shift operator in many languages. But here, it should be recognized as two separate tokens. Most lexers, including logos as far as I can tell, will give priority to the longer match, so unless there's a space in between the two ">" symbols the expression would not be parsed properly. Here again, the parser needs to be able to instruct the lexer on how to recognize tokens dynamically, maybe even telling it to "back up a bit".

Are there solutions to these kinds of problems within logos?

@maciejhirsz
Copy link
Owner

Generally any time you want "x but context aware" the answer almost always is "do it in the parser".

Couple things that can help here:

  1. For things like changing the token variant pending on previous token, you could try to hack something using extras.
  2. For things like recognizing >> as either two separate angle brackets or right shift you can have a separate token enum defined for just generic types, and then use the morph method to switch between the two lexers when parsing generics or not. This currently is a bit unwieldy since morph requires you to take the previous lexer by owned value, once the rewrite is done it should be much nicer (see last release notes and I'd like to volunteer! #291).

That said for 2 I don't see why a parser when building AST for generics can't just accept > and >> as closing brackets, and for 1. you could also make the parser accept certain keywords as identifiers (or vice versa) depending on context (AFAIK in rustc all keywords are just idents and it's the parser that rejects idents that are keywords based on their value).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants