Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document support for preprocessing code #108

Open
Victorious3 opened this issue Apr 22, 2019 · 12 comments
Open

Document support for preprocessing code #108

Victorious3 opened this issue Apr 22, 2019 · 12 comments
Assignees

Comments

@Victorious3
Copy link
Collaborator

Victorious3 commented Apr 22, 2019

Handling comments inside my grammar has been such a performance drag that I decided to strip them in a preprocessing step (I have /* nested /* comments */*/). There's a method called _preprocess in Buffer which I ended up overriding for this purpose.

Sadly that completely messes with the line numbers. I found no provision in TatSu for this so I ended up generating my own LineCache in a very similar way to TatSu and converting the "wrong" line numbers to my "real" line numbers before giving my diagnostics. This... worked, but its obviously not ideal.

I have no idea how to generalize my solution but I still think TatSu could support this in some way, so I leave it open for discussion.

Here are my thoughts on it:

  • While adding nested comments would fix my issue for the time being, adding more features to my language will probably bring this up again in the future.
  • Parsing languages which rely on indentation would be another example where this could prove useful.
  • I was thinking about running a different grammar first instead of rolling my own ugly (but fast) lexer, but that wouldn't fix the line numbers either. Maybe this could be included? A "preprocessing grammar"?
  • Maybe I could let my preprocessor insert #line directives (see C). If those were supported by TatSu it could make for a simple but powerful solution. (Such a feature would have to be customizable to avoid clashes). Cons: I'd have to do math to figure out what directives to generate. Math is annoying.
@apalala
Copy link
Collaborator

apalala commented Apr 22, 2019

It should be much easier to strip complex comments by overriding eat_comments().

@Victorious3
Copy link
Collaborator Author

Victorious3 commented Apr 22, 2019

True, I didn't think about that.
It does some other work tho, such as getting rid of multiple newlines in succession to simplify my grammar further. I could maybe do that with eat_comments as well but it doesn't make it feel like less of a hack tbh.

I found that TatSu does preprocessing itself by making the lines appear like comments so that it doesn't matter to leave the directives in. Nice shortcut on that one xD

@apalala
Copy link
Collaborator

apalala commented Apr 22, 2019

You can also override eat_whitespace() (or whatever it's called) to make things clearer.

@Victorious3
Copy link
Collaborator Author

... fine but what If I find something that requires actual preprocessing

@apalala
Copy link
Collaborator

apalala commented Apr 22, 2019

A preprocessor is at most a macro interpreter. For actual preprocessing, you chain PEG parsers...

@Victorious3
Copy link
Collaborator Author

And can I do that while also keeping the line numbers intact?

@apalala
Copy link
Collaborator

apalala commented Apr 23, 2019

And can I do that while also keeping the line numbers intact?

It would take some work. The output of the first pass would not be plain text, but something that contains the line number information (which is something you mention in your original post).

The current implementation of Buffer supports includes, while preserving the line numbers accross files. I used it for accurate error reporting in languages with includes.

Take a look at buffering.Buffer.process_block(). It allows for any transformation on the input text, as long as lines and index remain consistent. The index contains the references to the original sourcecode lines, and lines is the resulting text. Within process_block() a complete preprocessor parse may happen. The key algorithm is that for a preprocessed expression at lines[i:j] in the original source code, you do:

lines[i:j] = preprocessed_lines  ## may be []
index[i:j]= LineIndexInfo.block_index('such changes, or filename', len(preprocessed_lines))
return lines, index

Preprocessing may be the least reviewed part of TatSu, as I wrote all of it in a hurry for COBOL and NaturalAG. There's probably room for improvement.

@apalala
Copy link
Collaborator

apalala commented Nov 28, 2021

This is solved on my last comment.

This is an example from the actual COBOL parser:

    def _preprocess_block(self, name, block, **kwargs):
        block = uncomment_exec_sql(block)
        (lines, index) = super()._preprocess_block(name, block, **kwargs)
        continuations.preprocess_lines(lines, index)
        return (lines, index)

    def process_block(self, name, lines, index, **kwargs):
        lines = [self.normalize_cobol_line(i, c) for i, c in enumerate(lines)]

        n = 0
        while n < len(lines):
            if COPYRE.match(lines[n]):
                n = self.resolve_copy(n, lines, index, **kwargs)
            else:
                n += 1

        return lines, index

@apalala
Copy link
Collaborator

apalala commented Nov 28, 2021

I'm changing the title to leave the issue open make it a documentation request.

@apalala apalala changed the title Support for preprocessing code Document support for preprocessing code Nov 28, 2021
@apalala
Copy link
Collaborator

apalala commented Aug 20, 2023

Justa PING to myself, because this is now a documentation request.

@apalala
Copy link
Collaborator

apalala commented Aug 20, 2023

@Victorious3, At some point Buffer became a particular implementation of Tokenizer (it was when working on the new Python parser, which has it's own tokenizer).

As to write the good documentation, couldn't your case be solved by a chain of Tokenizer?

@apalala
Copy link
Collaborator

apalala commented Aug 24, 2023

For your issue to be solved with a chain of Tokenizer I think we need a way to keep the original line numbers.

PEG originally doesn't use a tokenizer because it can drill down to comments and tokens, but the work with the Python PEG parser was much easier because there was a tokenizer.

The introduction of Tokenizer in TatSu was to allow strategies different from the tex-based Buffer protocol.

Let's leave this open, and think more about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
release-plan
  
To Do
Development

No branches or pull requests

2 participants