Format preprocessed token stream in multiple passes #1898

kbieganski · 2023-04-28T15:32:08Z

Currently, the formatter doesn't handle many scenarios involving preprocessor ifdefs/endifs interleaved with begins, module headers, etc. (#228, #241, #267) This patch attempts to solve this problem by performing multiple passes of the formatter on preprocessed variants of the source. Each of these variants has a different set of preprocessor branches enabled. Together, they should cover the entire source (though that doesn't work in all cases yet). After several formatting passes for different variants of the AST, a correct and properly formatted file is produced.

This is still work in progress, so not everything works, and the code isn't very clean. I'd love to get some early feedback on this.

kbieganski · 2023-05-17T10:22:59Z

@hzeller Pinging in case you missed it

kbieganski · 2023-05-23T13:51:48Z

@hzeller I looked at the other perprocessor PRs you mentioned, and I think most of it is in master, except for +define+ handling (even though Verible claims to support it? I may be missing something) and possibly other command line options. That's not really relevant to the formatter though.

But I did decide to reuse FlowTree, which I didn't even realize was available in master, so thanks for pointing me to those PRs.

hzeller · 2023-05-23T15:34:37Z

I think the other pull request has been mostly added to Verible in separate sub-pull requests, so they are probably mostly added; mostly my remark was to figure out if there are things still missing.

Sorry, I did not have a look at this PR yet - I was busy, then vacation and then forgot about it. I'll have a look later today.

kbieganski · 2023-06-07T15:42:44Z

The biggest challenges right now, from looking at the tests, are:

Matching column alignment across preprocessor control flow. Unfortunately, while formatting a single variant of the file, the formatter is completely unaware of any of the 'disabled' code. It is unable to infer alignment that takes all variants into account. I'm trying to do some sort of a pre-pass that would gather alignment info and then apply that to all formatting passes, but it's not going great so far. Perhaps it would be a good idea to clean this up and merge it behind a feature flag, and work on alignment in the scope of another PR, as I fear it will require an extensive gutting of the formatter code. That is provided the Verible maintainers like the approach presented in this PR :)
There are some problems with convergence. However, I think this is mostly caused by the previous point.

hzeller · 2023-06-17T02:15:31Z

I can imagine that filtering out branches can result in the potential problems you mentioned.

Maybe we can mostly mitigate that by trying to not include preprocessing branches that don't do any damage, and don't include them in the finding of MinCoverDefineVariants() ?

The following (intentionally unaligned) input for instance will be properly aligned in the current state of the formatter:

module foo(
input a,
`ifdef WITH_B
   input      [9999:0]  b,
`endif
  input wire [99:0] c
);
endmodule

So here, here we don't need to do anything special in the brancing of WITH_B as it does not create unbalanced syntax elements the parser would stumble upon. As a result, the formater sees all the parts necessary to generate pleasent alignment; however, if that same WITH_B is later used to create an unbalanced begin ... end block, then we're in trouble if we had not included it in the MinCoverDefineVariants()...

So maybe we need to not identify all the macros that create variants but drill down to only the preprocessing branches that cause invalid syntax trees if parsed without extra treatment ?

Detection of these might be a bit more complicated. Maybe we can have a very simple manual parser just looking at the token stream that e.g. counts if a block contains a balanced set of begin/end in each preprocessing branch (counter pluse/minus) ? There might be other common pairs, e.g. begin/end, module/endmodule ... so this might get a bit unwieldy if there are too many of these potentially nesting pairs to worry about...

If we have such a thing, it could also reduce the number of separate passes we need to do through the file and less chance of coming into a situation where we mess up alignment due to that.

Maybe we should have a video conference to discuss some ideas.

And yes, having the feature behind a flag can help (and if we're confident enough and it works well, we can switch it on by default, but allow to switch off).

(Sorry it took so long - I was distracted with other projects I am part of)

hzeller · 2023-06-17T00:42:05Z

verilog/preprocessor/verilog_preprocess.h

@@ -115,7 +118,9 @@ class VerilogPreprocess {

    // Expand macro definition bodies, this will relexes the macro body.
    bool expand_macros = false;
-    // TODO(hzeller): Provide a map of command-line provided +define+'s
+
+    MacroDefinitionRegistry macro_definitions;


This might not be the right place for this registry, but I suspect this is here currently mostly for the experimenting.

The FileList struct has such information as well, which should probably be consolidated with what we have here.

The final location might probably be yet different; there are a bunch of TODO's around FileList anyway.

I suspect you just have it here for now to make quick progress, but once that is nearing completion, we probably need to address that 'where to put things'.

hzeller · 2023-06-17T01:10:20Z

verilog/formatting/formatter.cc

@@ -199,10 +213,10 @@ static Status ReformatVerilog(absl::string_view original_text,
 }

 static absl::StatusOr<std::unique_ptr<VerilogAnalyzer>> ParseWithStatus(
-    absl::string_view text, absl::string_view filename) {
+    absl::string_view text, absl::string_view filename,
+    const verilog::VerilogPreprocess::Config& preprocess_config = {}) {


We could use the filter_branches config from the preprocess_config to enable the 'formatting with preprocessing' feature (so that we can test it with a flag)

hzeller · 2023-06-17T01:15:08Z

verilog/formatting/formatter.cc

+    } else if (IsPreprocessorControlFlow(verilog_tokentype(
+                   uwline.TokensRange().begin()->TokenEnum()))) {
+      UnwrappedLine formatted_line = uwline;
+      formatted_line.SetIndentationSpaces(0);


This zero is the indentation of the macro-line ? Maybe this can be made a constexpr somewhere so that it is easy to find in case someone wants to indent them in a particular way and wants to find the place
(not sure how common that is in verilog, but it could be some feature that people might request later)

hzeller · 2023-06-17T01:17:00Z

verilog/formatting/formatter.cc

-      if (!formatted_lines_.back().CompletedFormatting()) {
-        // Copy over any lines that did not finish wrap searching.
-        partially_formatted_lines.push_back(&uwline);
+      if (!uwline.TokensRange().front().token->visible) {


At some point we should worry about the cyclomatic complexity of this function, but good enough for this first work.

verilog/formatting/token_annotator.cc

hzeller · 2023-06-17T01:25:34Z

verilog/formatting/tree_unwrapper.cc

  State new_state = old_state;
  switch (old_state) {
+    case kPreprocessorControlFlow: {
+      if (token_type != PP_Identifier) {


is that so that we don't start a new partition between the ifdef and the identifier ?

codecov-commenter · 2023-07-09T20:21:21Z

Codecov Report

Patch coverage: 94.07% and project coverage change: -0.01 ⚠️

Comparison is base (92dc626) 92.84% compared to head (1319cd3) 92.84%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1898      +/-   ##
==========================================
- Coverage   92.84%   92.84%   -0.01%     
==========================================
  Files         355      355              
  Lines       26249    26538     +289     
==========================================
+ Hits        24372    24640     +268     
- Misses       1877     1898      +21

Impacted Files	Coverage Δ
common/text/text_structure.h	`100.00% <ø> (ø)`
common/text/token_info.h	`100.00% <ø> (ø)`
common/util/vector_tree.h	`98.66% <ø> (ø)`
verilog/analysis/flow_tree.h	`100.00% <ø> (ø)`
verilog/analysis/verilog_project.cc	`85.47% <ø> (ø)`
verilog/formatting/formatter.h	`100.00% <ø> (ø)`
verilog/formatting/tree_unwrapper.h	`66.66% <ø> (ø)`
verilog/preprocessor/verilog_preprocess.h	`100.00% <ø> (ø)`
common/formatting/tree_unwrapper.cc	`70.94% <9.09%> (-5.15%)`	⬇️
verilog/formatting/tree_unwrapper.cc	`94.46% <90.00%> (-0.33%)`	⬇️
... and 13 more

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

hzeller · 2023-07-13T22:51:36Z

Interesting that there are issues on Mac and Windows. I suspect that either newline confusion (on Windows) or some uninitialized value that randomly defaults to something else on these platforms.

Currently, the formatter doesn't handle many scenarios involving preprocessor `ifdef`s/`endif`s interleaved with `begin`s, module headers, etc. This patch attempts to solve this problem by performing multiple passes of the formatting on preprocessed variants of the source. Each of these variants has a different set of preprocessor branches enabled. Together, they should cover the entire source (though that doesn't work in all cases yet). After several formatting passes for different variants of the AST, a correct and properly formatted file is produced. Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>

kbieganski · 2023-07-14T10:55:46Z

The Windows and MacOS failures were basically due to some incorrect/UB pointer arithmetic. It works now. There's also the Read the Docs failure, not sure what that's about.

The multi-pass formatter is able to deal with all unit tests now, with the exception of two cases. First one:

module pd (
`ifdef FAA
    input  baaaz_t foo,
`else
    output reg     bar
`endif
);
endmodule : pd

(there's one more similar to this)
It sees a syntax error in the first port declaration. If FAA is defined, there's a dangling comma. I would argue Verible should either not accept this code, as it really is a syntax error, or just accept such dangling commas.

Second case:

`J(D(`ifdef e))

The multi-pass formatter doesn't change this, but according to the test, it should be formatted into:

`J(D(
   `ifdef e))

I don't really understand this test case, so I can't comment on the validity, but I think it's okay to not support this for now.

The multi-pass formatter can also format all of VeeR, except for two files where Verible's parser cannot deal with some preprocessor identifiers (but that doesn't work with the default formatter anyway). I haven't checked the result thoroughly, as it's a lot of code, but at least it doesn't crash and there is no divergence.

The PR is basically ready for review; the one architectural thing I don't like is that I had to put a new field in TokenInfo to distinguish tokens from different preprocessor passes. It's kind of weird, as it is only meaningful for the Verilog formatter. The fact that Verible is split into the common and verilog worlds does not help here. I don't have a great idea how to solve this for now.

hzeller · 2024-02-02T07:46:36Z

Just looking through pull requests that didn't finish up to the last development cycle. This one certainly should not get lost.

Since this needs rebase, I might rebase it into a new pull request. Probably collecting more comments here, then addressing them in the new PR. I am doing this in my free time, so probably only progress in evenings and weekends.

The use-after-move situations (clang-tidy 17 points them out nicely) is something I'd address, because we really can't assume that it would be safe to access afterwards (even though it typically works).

kbieganski requested a review from hzeller April 28, 2023 15:32

kbieganski force-pushed the formatter-preprocess branch from 0ece5aa to 10487a4 Compare May 23, 2023 13:44

kbieganski force-pushed the formatter-preprocess branch from 10487a4 to ffba497 Compare May 23, 2023 15:26

kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from af3df1c to da412ef Compare May 24, 2023 10:36

kbieganski force-pushed the formatter-preprocess branch from da412ef to 598c025 Compare June 7, 2023 15:33

hzeller reviewed Jun 17, 2023

View reviewed changes

ddppt-yy mentioned this pull request Jul 3, 2023

how to parser a file with macro？ #1952

Open

kbieganski force-pushed the formatter-preprocess branch from 598c025 to c90b681 Compare July 7, 2023 16:38

kbieganski marked this pull request as ready for review July 7, 2023 16:40

kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from 7669ef9 to 2324a37 Compare July 9, 2023 20:02

kbieganski force-pushed the formatter-preprocess branch 2 times, most recently from 457da90 to ffe1581 Compare July 13, 2023 16:44

kbieganski force-pushed the formatter-preprocess branch from ffe1581 to 1319cd3 Compare July 14, 2023 10:01

hzeller mentioned this pull request Oct 28, 2023

define is not properly considering #2033

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format preprocessed token stream in multiple passes #1898

Format preprocessed token stream in multiple passes #1898

kbieganski commented Apr 28, 2023 •

edited

kbieganski commented May 17, 2023

kbieganski commented May 23, 2023

hzeller commented May 23, 2023

kbieganski commented Jun 7, 2023

hzeller commented Jun 17, 2023

hzeller Jun 17, 2023

hzeller Jun 17, 2023

hzeller Jun 17, 2023

hzeller Jun 17, 2023

hzeller Jun 17, 2023

kbieganski Jul 7, 2023

codecov-commenter commented Jul 9, 2023 •

edited

hzeller commented Jul 13, 2023

kbieganski commented Jul 14, 2023 •

edited

hzeller commented Feb 2, 2024

Format preprocessed token stream in multiple passes #1898

Are you sure you want to change the base?

Format preprocessed token stream in multiple passes #1898

Conversation

kbieganski commented Apr 28, 2023 • edited

kbieganski commented May 17, 2023

kbieganski commented May 23, 2023

hzeller commented May 23, 2023

kbieganski commented Jun 7, 2023

hzeller commented Jun 17, 2023

hzeller Jun 17, 2023

Choose a reason for hiding this comment

hzeller Jun 17, 2023

Choose a reason for hiding this comment

hzeller Jun 17, 2023

Choose a reason for hiding this comment

hzeller Jun 17, 2023

Choose a reason for hiding this comment

hzeller Jun 17, 2023

Choose a reason for hiding this comment

kbieganski Jul 7, 2023

Choose a reason for hiding this comment

codecov-commenter commented Jul 9, 2023 • edited

Codecov Report

hzeller commented Jul 13, 2023

kbieganski commented Jul 14, 2023 • edited

hzeller commented Feb 2, 2024

kbieganski commented Apr 28, 2023 •

edited

codecov-commenter commented Jul 9, 2023 •

edited

kbieganski commented Jul 14, 2023 •

edited