Document support for preprocessing code #108

Victorious3 · 2019-04-22T12:02:44Z

Handling comments inside my grammar has been such a performance drag that I decided to strip them in a preprocessing step (I have /* nested /* comments */*/). There's a method called _preprocess in Buffer which I ended up overriding for this purpose.

Sadly that completely messes with the line numbers. I found no provision in TatSu for this so I ended up generating my own LineCache in a very similar way to TatSu and converting the "wrong" line numbers to my "real" line numbers before giving my diagnostics. This... worked, but its obviously not ideal.

I have no idea how to generalize my solution but I still think TatSu could support this in some way, so I leave it open for discussion.

Here are my thoughts on it:

While adding nested comments would fix my issue for the time being, adding more features to my language will probably bring this up again in the future.
Parsing languages which rely on indentation would be another example where this could prove useful.
I was thinking about running a different grammar first instead of rolling my own ugly (but fast) lexer, but that wouldn't fix the line numbers either. Maybe this could be included? A "preprocessing grammar"?
Maybe I could let my preprocessor insert #line directives (see C). If those were supported by TatSu it could make for a simple but powerful solution. (Such a feature would have to be customizable to avoid clashes). Cons: I'd have to do math to figure out what directives to generate. Math is annoying.

The text was updated successfully, but these errors were encountered:

apalala · 2019-04-22T13:07:16Z

It should be much easier to strip complex comments by overriding eat_comments().

Victorious3 · 2019-04-22T13:14:22Z

True, I didn't think about that.
It does some other work tho, such as getting rid of multiple newlines in succession to simplify my grammar further. I could maybe do that with eat_comments as well but it doesn't make it feel like less of a hack tbh.

I found that TatSu does preprocessing itself by making the lines appear like comments so that it doesn't matter to leave the directives in. Nice shortcut on that one xD

apalala · 2019-04-22T21:09:35Z

You can also override eat_whitespace() (or whatever it's called) to make things clearer.

Victorious3 · 2019-04-22T22:21:34Z

... fine but what If I find something that requires actual preprocessing

apalala · 2019-04-22T22:37:06Z

A preprocessor is at most a macro interpreter. For actual preprocessing, you chain PEG parsers...

Victorious3 · 2019-04-22T22:56:36Z

And can I do that while also keeping the line numbers intact?

apalala · 2019-04-23T00:03:48Z

And can I do that while also keeping the line numbers intact?

It would take some work. The output of the first pass would not be plain text, but something that contains the line number information (which is something you mention in your original post).

The current implementation of Buffer supports includes, while preserving the line numbers accross files. I used it for accurate error reporting in languages with includes.

Take a look at buffering.Buffer.process_block(). It allows for any transformation on the input text, as long as lines and index remain consistent. The index contains the references to the original sourcecode lines, and lines is the resulting text. Within process_block() a complete preprocessor parse may happen. The key algorithm is that for a preprocessed expression at lines[i:j] in the original source code, you do:

lines[i:j] = preprocessed_lines  ## may be []
index[i:j]= LineIndexInfo.block_index('such changes, or filename', len(preprocessed_lines))
return lines, index

Preprocessing may be the least reviewed part of TatSu, as I wrote all of it in a hurry for COBOL and NaturalAG. There's probably room for improvement.

apalala · 2021-11-28T15:42:51Z

This is solved on my last comment.

This is an example from the actual COBOL parser:

    def _preprocess_block(self, name, block, **kwargs):
        block = uncomment_exec_sql(block)
        (lines, index) = super()._preprocess_block(name, block, **kwargs)
        continuations.preprocess_lines(lines, index)
        return (lines, index)

    def process_block(self, name, lines, index, **kwargs):
        lines = [self.normalize_cobol_line(i, c) for i, c in enumerate(lines)]

        n = 0
        while n < len(lines):
            if COPYRE.match(lines[n]):
                n = self.resolve_copy(n, lines, index, **kwargs)
            else:
                n += 1

        return lines, index

apalala · 2021-11-28T15:43:28Z

I'm changing the title to leave the issue open make it a documentation request.

apalala · 2023-08-20T19:05:06Z

Justa PING to myself, because this is now a documentation request.

apalala · 2023-08-20T19:08:47Z

@Victorious3, At some point Buffer became a particular implementation of Tokenizer (it was when working on the new Python parser, which has it's own tokenizer).

As to write the good documentation, couldn't your case be solved by a chain of Tokenizer?

apalala · 2023-08-24T19:37:30Z

For your issue to be solved with a chain of Tokenizer I think we need a way to keep the original line numbers.

PEG originally doesn't use a tokenizer because it can drill down to comments and tokens, but the work with the Python PEG parser was much easier because there was a tokenizer.

The introduction of Tokenizer in TatSu was to allow strategies different from the tex-based Buffer protocol.

Let's leave this open, and think more about it.

Victorious3 added the enhancement label Apr 22, 2019

apalala added this to To Do in release-plan Apr 25, 2019

apalala mentioned this issue May 26, 2019

Add support for tokenizers #131

Merged

apalala assigned Victorious3 Jun 2, 2019

apalala changed the title ~~Support for preprocessing code~~ Document support for preprocessing code Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document support for preprocessing code #108

Document support for preprocessing code #108

Victorious3 commented Apr 22, 2019 •

edited

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019 •

edited

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019

apalala commented Apr 23, 2019

apalala commented Nov 28, 2021

apalala commented Nov 28, 2021

apalala commented Aug 20, 2023

apalala commented Aug 20, 2023

apalala commented Aug 24, 2023

Document support for preprocessing code #108

Document support for preprocessing code #108

Comments

Victorious3 commented Apr 22, 2019 • edited

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019 • edited

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019

apalala commented Apr 22, 2019

Victorious3 commented Apr 22, 2019

apalala commented Apr 23, 2019

apalala commented Nov 28, 2021

apalala commented Nov 28, 2021

apalala commented Aug 20, 2023

apalala commented Aug 20, 2023

apalala commented Aug 24, 2023

Victorious3 commented Apr 22, 2019 •

edited

Victorious3 commented Apr 22, 2019 •

edited