Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way of handling languages that have unicode identifiers? #72

Open
ethindp opened this issue Mar 17, 2024 · 10 comments
Open

Best way of handling languages that have unicode identifiers? #72

ethindp opened this issue Mar 17, 2024 · 10 comments

Comments

@ethindp
Copy link

ethindp commented Mar 17, 2024

I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers. (The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)

I see a couple ways to get this to work:

  • I could take all the unicode codepoints making up a general category, per the Ada specification (SEC. 2.3), generate rules matching them, and then for each compound codepoint split it into chars.
  • Write a lexer to generate tokens, and parse on those.

The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those. But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?

I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one). If I could do that, I could just import nim-unicodedb and check against it's version of the UCD.

@zevv
Copy link
Owner

zevv commented Mar 18, 2024

I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers.

Interesting task you set yourself! Will this be limited to UTF-8 encoding only?

(The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)

Well, size-wise I don't expect any big problems, I've generated some humongous parsers in the past without running into any issues, although compilation time might increase a bit. Run time performance and memory usage of NPeg parsers should not suffer from grammar size.

  • I could take all the unicode codepoints making up a general category, [...]
  • Write a lexer to generate tokens, and parse on those.

The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those.

I need more details to get a clear mental picture of what you're up to, but I guess one important question is how strict you need your parser to be - do you want to accept valid Ada, and only valid Ada, or is it acceptable to make the parser more lenient and accept things that might violate the spec? The latter case would probably massively simplify the efforts you need to put into this.

But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?

Which issue are you referring to? Would https://github.com/zevv/npeg?tab=readme-ov-file#passing-state be what you are looking for?

I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one)

That's where code blocks can usually help: there is no reason to try to solve everything inside the NPeg parser grammar itself, you are free to delegate parts to Nim code if that helps you. If you find the current API lacking, I'd be more then happy to see if we can mold NPeg to support your use case.

@ethindp
Copy link
Author

ethindp commented Mar 19, 2024

Interesting task you set yourself! Will this be limited to UTF-8 encoding only?

Yes; I could expand it to UTF-16 and UTF-32 later, though that wouldn't be very hard since I believe all UTF-8 is valid UTF-16 and UTF-32. So I'd still need to support all Unicode code points. (IMO converting the data to UTF-32 would make it easier to work with, if only because then I wouldn't need to worry about problems like "is this byte in the middle of a code point?".) However, since Nim mainly supports UTF-8, that might be a good place to start.

I need more details to get a clear mental picture of what you're up to, but I guess one important question is how strict you need your parser to be - do you want to accept valid Ada, and only valid Ada, or is it acceptable to make the parser more lenient and accept things that might violate the spec? The latter case would probably massively simplify the efforts you need to put into this.

Can you define "things that might violate the spec"? Do you mean accepting a subset of the overall Unicode codepoint space (like extended ASCII)? (For context, I think being more lenient might bite me later because the spec's math library defines both a constant pi and a constant π; see here.)

As for making it more complicated, I don't think it necessarily would. I can generate the unicode parsing rules using ucd-generate and then taking that Rust code and dumping a set of Unicode general category rules from that, since I'm too lazy to try to parse the UCD myself lol.

Which issue are you referring to? Would https://github.com/zevv/npeg?tab=readme-ov-file#passing-state be what you are looking for?

I was referring to the ability to change the token type used, where someone suggested you could use square brackets in the grammar... I think passing state might be it but I might be wrong.

That's where code blocks can usually help: there is no reason to try to solve everything inside the NPeg parser grammar itself, you are free to delegate parts to Nim code if that helps you. If you find the current API lacking, I'd be more then happy to see if we can mold NPeg to support your use case.

Can you explain this more? How could I overwrite the parser like that in parts if I need to specify the rule to match first?

@zevv
Copy link
Owner

zevv commented Mar 19, 2024

Can you define "things that might violate the spec"? Do you mean accepting a subset of the overall Unicode codepoint space (like extended ASCII)?

If you want to only accept identifiers as specified, you will need a strict definition of all the individual parts (letter_uppercase, letter_lowercase), etc. For some parsers I made in the past I also got away with gross simplifications like handling "every sequence of non-whitespace characters not matching a known keyword or operator" as an identifier, without strictly verifying that what is parsed here actually matches the identifier rules.

But of course this depends on your requirements, and indeed it might bite you later

As for making it more complicated, I don't think it necessarily would. I can generate the unicode parsing rules using ucd-generate and then taking that Rust code and dumping a set of Unicode general category rules from that, since I'm too lazy to try to parse the UCD myself lol.

Sounds like a solid plan.

That's where code blocks can usually help: there is no reason to try to solve everything inside the NPeg parser grammar itself, you are free to delegate parts to Nim code if that helps you. If you find the current API lacking, I'd be more then happy to see if we can mold NPeg to support your use case.

Can you explain this more? How could I overwrite the parser like that in parts if I need to specify the rule to match first?

Aah, right, I just realized I lied.

In an earlier version of NPeg the internal parser state was accessible to the code blocks (although not documented) so it was possible to inspect and modify the subject string and parser index from Nim code, basically allowing you to say: I'll take over from here, parse a bit with Nim code, update the NPeg state, and then giving control back to NPeg. See the bottom example in the first message at #14.

This is no longer possible in the current implementation, although I once planned to bring this back as a proper feature when the need arises.

@ethindp
Copy link
Author

ethindp commented Mar 22, 2024

Yeah so I don't know how well this is going to work. Is there a way I can include rules from other parsers? Or write rules in a separate file? The size of the UCD means that if I include all general categories the resulting list of rules is about 70 MB and I don't even know if it'll even parse lol. (No, I didn't think it'd be that big, but that's what you get when you have to split multi-byte characters into their individual bytes...)

@ethindp
Copy link
Author

ethindp commented Mar 22, 2024

This is particularly problematic since the Ada specification uses the graphic character alias category for character literals, which, according to Wikipedia, is "those with General Category Letter, Mark, Number, Punctuation, Symbol or Zs=space."

@zevv
Copy link
Owner

zevv commented Mar 22, 2024

Yeah so I don't know how well this is going to work. Is there a way I can include rules from other parsers? Or write rules in a separate file?

Yes, see https://github.com/zevv/npeg?tab=readme-ov-file#composing-grammars-with-libraries

@ethindp
Copy link
Author

ethindp commented Mar 22, 2024

@zevv Oh good. Though figuring out how to debug this is going to be an utter pain... Especially if it doesn't compile. I really hope I got the syntax right on the first go.

@zevv
Copy link
Owner

zevv commented Mar 22, 2024

@zevv Oh good. Though figuring out how to debug this is going to be an utter pain... Especially if it doesn't compile. I really hope I got the syntax right on the first go.

My advice would be to build your grammar bottom up; first write the parsers for your basic atoms like identifiers, keywords and operators. Write little tests and keep these in place. From there slowly build up the grammar itself, parsing larger and larger fragments.

So, what is the end product of this parser for you? Will you be generating AST of the code?

@ethindp
Copy link
Author

ethindp commented Mar 22, 2024

Most likely, yes. I've written a compiler before (not the cleanest, and for University) but we didn't cover ASTs/semantic analysis/code generation as it is done now. I don't know how far I'll get but I'm willing to give it a go anyway lol

@ethindp
Copy link
Author

ethindp commented Mar 22, 2024

@zevv Well we might have an issue: when I build this small parser I get this error in debug builds of the compiler, and in release builds it crashes:

Error: call depth limit reached in a debug build (2000 function calls). You can change it with -d:nimCallDepthLimit=<int> but really try to avoid deep recursions instead.

I have no idea where the recursion is happening, but it may be happening in the peg macro. Or maybe the grammar one... P pinged you on Matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants