Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FUTURE: Gather ideas for column strategy #336

Open
rdiachenko opened this issue Aug 30, 2020 · 4 comments
Open

FUTURE: Gather ideas for column strategy #336

rdiachenko opened this issue Aug 30, 2020 · 4 comments

Comments

@rdiachenko
Copy link
Contributor

Column strategy should be able to distinguish between existing and new violations on the changed code lines and show new violations only. Current patchedline strategy can't understand what exactly was changed in the line (e.g. a whitespace was added between words), so it may show old violations as well.

Investigate ways to implement a column strategy. One of them can be usage of https://github.com/google/diff-match-patch to compare an old and a new line and get the columns within those lines with a new text.

@romani romani changed the title Gather ideas for column strategy FUTURE: Gather ideas for column strategy Aug 30, 2020
@rdiachenko
Copy link
Contributor Author

rdiachenko commented Mar 8, 2023

Here're few more unverified ideas to check and play with.

Try to detect location of changed code more accurate

Given a patch we can try to obtain previous code version by reverting that patch. In git it can be done via git apply -R <patch>. Having old and new code we can apply Meyer's or Histogram diff algorithms (either from the currently used org.eclipse.jgit library or java-diff-utils and find start/end position of changed code within a line of code. Those positions are columnNo in Checkstyle terminology which can be use for more accurate violation filtration.

In the org.eclipse.jgit changes are represented as a list of org.eclipse.jgit.diff.Edit instances. Each Edit instance contains start / end lines of code blocks which were updated:

// start / end lines in code block before patch
int beginA;
int endA;

// start / end lines in code block after patch
int beginB;
int endB;

This information can potentially be used in detecting start / end columnNo within a line marking new code or replaced one.

We can also try to configure diff algorithms mentioned above to skip whitespace characters. However, whitespace related checks may be affected. We can also consider a size of changed code block, if it's one line or few character change it probably makes sense to just skip such outliers and ignore new violations.

Use AST trees to accept only new violations

Given AST tree of the current code we can try to obtain an AST tree for the code before patch. Having old and new trees we can try to accept only new violations by comparing both trees.

Another option to investigate is to run Checkstyle second time on the old AST tree and just get a diff between two runs.

This option won't work on input files which are not represented as AST trees.

Do we need this in Checkstyle?

Considering Checkstyle is just a library one may write a plugin to gradle / maven / etc. or just a CI script to run Checkstyle twice (on code before and after patch) and fail if diff is not empty. May not be a viable option if codebase is huge, however, it may still be faster than running all the tests.

Another option is to cache Checkstyle execution on master / main and compare with new execution on pathed code in a different branch.

@rdiachenko
Copy link
Contributor Author

Existing attempts to solve the problem:

@rdiachenko
Copy link
Contributor Author

Checking the following idea.

Do we need this in Checkstyle?

  1. Find a diff between main and current branch
[main]$ git checkout -b feature

# Apply changes to `feature` and commit them

[feature]$ git diff $(git merge-base main feature) feature > feature.patch

git merge-base gives the most recent common commit between the branches.

Using origin/main should help in case local main is outdated.

  1. List all modified files
[feature]$ git diff --name-only $(git merge-base main feature) \
  feature > modified-files.txt
  1. Apply the patch in reverse
[feature]$ git apply -R feature.patch
  1. Run Checkstyle against modified files
[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \
  -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-main.txt
  1. Apply the patch
[feature]$ git apply feature.patch
  1. Run Checkstyle against modified files
[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \
  -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-feature.txt
  1. Compare Checkstyle executions
[feature]$ comm -13 <(sort checkstyle-main.txt) <(sort checkstyle-feature.txt)

Todo

  1. Need to compare Checkstyle runs in a more clever way. Some existing violations may change their position which gives false postitives.
  2. How to generate Checkstyle report for new violations only?

@romani
Copy link
Member

romani commented Apr 3, 2023

Only comparison of xpath on violation , can help with detection of new or very different by structure same violation ( ok to keep as new).
We have feature in CLI to print xpath on violations, it can help.

Comparison by line and column will be too unreliable.

We might end up in state to figure out algorithm how to compare two xpath and consider them the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants