Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar check: Passe min_lines to recombined. #4175

Merged
merged 1 commit into from Jul 28, 2021
Merged

Similar check: Passe min_lines to recombined. #4175

merged 1 commit into from Jul 28, 2021

Conversation

JulienPalard
Copy link
Contributor

@JulienPalard JulienPalard commented Mar 2, 2021

Steps

  • Add yourself to CONTRIBUTORS if you are a new contributor.
  • Add a ChangeLog entry describing what your PR does.
  • If it's a new feature or an important bug fix, add a What's New entry in doc/whatsnew/<current release.rst>.
  • Write a good description on what the PR does.

Description

It closes #4118 by passing the option to the merger instance (the recombined.min_lines = self.min_lines part).

While working on it @doublethefish spotted the score was wrong: errors raised during the merge were not accounted for, it is now (the linter.stats = _merge_stats([linter.stats] + all_stats) part).

I tried to implement a test to avoid regression, I'm not fluent with pylint internals yet, reviews welcome ♥

Type of Changes

Type
🐛 Bug fix

Related Issue

Closes #4118

@coveralls
Copy link

coveralls commented Mar 2, 2021

Coverage Status

Coverage remained the same at 92.037% when pulling bf7282a on JulienPalard:mdk/similar-min-lines into 655a9bf on PyCQA:master.

@doublethefish
Copy link
Contributor

This was solve the issue in hand. But its a bit specific in this case.

A couple of points:

  • a test would be nice
  • it only fixes min_similar_lines and not the other configs for the SimilarChecker
  • other map/reduce checkers would need to parse out their parameters in a similar way

I’m not 100% sure, but it seems to me that the checker needs to be constructed in a way that lets the linter set the config on it.

@JulienPalard
Copy link
Contributor Author

* it only fixes min_similar_lines and not the other configs for the SimilarChecker

I don't think other parameters are used in the reduce step, but again, it's really specific to SimilarChecker.

* other map/reduce checkers would need to parse out their parameters in a similar way

Which I would not recommend, it's not that readable...

I’m not 100% sure, but it seems to me that the checker needs to be constructed in a way that lets the linter set the config on it.

Or should reduce_map_data be upgraded as a method instead of a classmethod, so it can directly access the config?

@doublethefish
Copy link
Contributor

Or should reduce_map_data be upgraded as a method instead of a class method, so it can directly access the config?

I thought the same thing, revisiting the code. But it, unfortunately, makes no difference and I do not mind admitting that that baffles me. I must be missing something (?).

I don't think other parameters are used in the reduce step, but again, it's really specific to SimilarChecker.

You are correct, the other settings are (currently) used when the lines are collected, not when analysed. 👍

This patch does fix this specific issue, but it masks the deeper issue. That said, I strongly suspect that it might be more useful to users of pylint to get the fix in, despite the general performance problem with MJ and the masking. There should be another way to demonstrate the deeper problem anyway.

@JulienPalard
Copy link
Contributor Author

JulienPalard commented Mar 3, 2021

I thought the same thing, revisiting the code. But it, unfortunately, makes no difference and I do not mind admitting that that baffles me. I must be missing something (?).

I tested your branch with a test project:

(cd /tmp/testproj/; pylint --min-similarity-lines=2000 testproj)

vs

(cd /tmp/testproj/; pylint --min-similarity-lines=2 testproj)

the first one does not report, while the 2nd one do report, so I think your fix is good but not your test, which I did not read yet.

I also tried by adding -j 1, -j 2, and -j 16, and had no issue, so it looks fixed from command line but not from config file?

@doublethefish
Copy link
Contributor

Ah thank you, I mispoke when I said ' makes no difference', it does improve things, fixing the similarities report as you say.

Do you get the same quality scores in both cases? I am seeing there are still differences between single and MJ.

The tests on that branch, which admitedly are a bit OTT, use a config-file to set the various settings for the runs. Based on your manual runs, I have added some which use CLI args as well. Thankfully I see the same effect.

@JulienPalard
Copy link
Contributor Author

Do you get the same quality scores in both cases? I am seeing there are still differences between single and MJ.

Damned, no. I'm getting 10.00 with -j 20 and 9.98 with -j 1. I have no idea how this is computed yet, and I need to go back to $DAYJOB for now.

@doublethefish
Copy link
Contributor

Thanks for confirming.

@shvenkat
Copy link
Contributor

shvenkat commented Mar 4, 2021

I just saw this PR. I opened #4178 a couple days ago for the same issue.

Do we need to instantiate a new checker? If we could use the existing one in the linter object, it's already configured properly.

@Pierre-Sassoulas
Copy link
Member

Hello, reading the discussion around this MR it seems the problem lie in the option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers would help to fix this issue easily, is that right ?

@JulienPalard
Copy link
Contributor Author

option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers

Option parsing does not look internal to the checker: the checker only expose an options class attribute wich is used for external parsing, which is good design IMHO.

When parsed, the options are injected to the checker via a def set_option callback, which is not bad neither.

Problem is, the set_option callback implies there's a single instance of the checker aware of the options, so in the current:

@classmethod
def reduce_map_data(cls, linter, data):
    """Reduces and recombines data into a format that we can report on

    The partner function of get_map_data()"""
    recombined = SimilarChecker(linter)
    ...

when we create another SimilarChecker, it's a "naive one", unaware of the options.

Discussions above showed three ways to find the "option-aware" SimilarChecer:

  • By getting the first one out of [c for c in linter.get_checkers() if c.name == cls.name], hoping for the best.
  • By getting the first one out of linter._checkers["similarities"], hoping for the best.
  • By making reduce_map_data an instance method instead of a classmethod, so self is the aware one.

I'm not fan of the first one, mine, it really looks tangled.

I'm not fan of the 2nd one neither, it does the same thing, it's shorter, but using a private attribute.

I'm not fan of turning reduce to a normal method: Having reduce_map_data being a classmethod instead of a method looks safer: it cleanly isolate the steps, which looks a good idea as the first step of the merge is to call open which itself destroys self.lineset: data could easily be lost.

But we could still use a normal method and create a clean new instance of Similar from here, let me try that.

@JulienPalard
Copy link
Contributor Author

Hum, it looks similar from what I did in here finally, diff from current PR and proposed variation:

-    @classmethod
-    def reduce_map_data(cls, linter, data):
+    def reduce_map_data(self, linter, data):
         """Reduces and recombines data into a format that we can report on

         The partner function of get_map_data()"""
         recombined = SimilarChecker(linter)
-        checker = [c for c in linter.get_checkers() if c.name == cls.name][0]
-        recombined.min_lines = checker.min_lines
+        recombined.min_lines = self.min_lines
         recombined.open()
         Similar.combine_mapreduce_data(recombined, linesets_collection=data)
         recombined.close()

On both cases we use a clean, new SimilarChecker, and in both cases we copy the config.

@JulienPalard
Copy link
Contributor Author

I prefer the version without checker = [c for c in linter.get_checkers() if c.name == cls.name][0] which looks like a bug magnet (Why there can be more than one? What happen if we pick the wrong one? ...)

@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.9.0 milestone Jun 8, 2021
@doublethefish
Copy link
Contributor

doublethefish commented Jun 8, 2021 via email

@JulienPalard
Copy link
Contributor Author

JulienPalard commented Jun 8, 2021

True, here's how I can reproduce it:

mkdir mdk-test
cd mdk-test  # This is to avoid using the pylint from pylint
cp ../pylint/constants.py ./file1.py
cp ../pylint/constants.py ./file2.py
pylint file1.py file2.py  # Gives 8.16/10, similar lines found
pylint --min-similarity-lines=999 file1.py file2.py  # Gives 8.42/10, no similar lines found
pylint --jobs 4 file1.py file2.py  # 8.42/10 (bad), similar lines found (good)
pylint --jobs 4 --min-similarity-lines=999 file1.py file2.py  # 8.42/10 (good), no similar line (good)

In PyLinter.stats in case of parallel runs, we're missing some data, for example self.stats['by_msg']['duplicate-code'] is missing when --jobs 2 but present when --jobs 1.

I'm slowly understanding it, in parallel.py the _merge_stats gets the stats from subprocesses, which does not know yet about the conflicting lines, we need to also merge with similar's map-reduced stats.

@JulienPalard
Copy link
Contributor Author

OK I was able to fix the stats issue: the stats generated during the reduce phase was left alone, not merged with all other stats.

@JulienPalard
Copy link
Contributor Author

I'm trying to write the tests, looks like there's still an issue with the _merge_stats.

@JulienPalard
Copy link
Contributor Author

Finally understood the stats merge issue \o/

I'm happy with this PR.

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look like a nice change (the test line added to line changed ratio is through the roof 🎺 🎉 ). I have a small comment about the changelog and a hard time to understand the test but for the latter that's probably on me 😄

ChangeLog Outdated Show resolved Hide resolved
tests/test_check_parallel.py Show resolved Hide resolved
Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@JulienPalard
Copy link
Contributor Author

a hard time to understand the test

I can understand :( The idea of the test is to check that running with --jobs 1 give the same result than --jobs n on a checker that raises messages during the reduce step (in the main process) instead of raising them from the worker processes.

So it focus on testing the score, not the min_lines parameter, and don't fail if we re-introduce the min_lines bug, which is not good.... ☹ (It fail if we break the "score propagation" fix though).

I can work on a test dedicated to min_line+jobs bug, but I need to sleep first ;)

tests/test_check_parallel.py Outdated Show resolved Hide resolved
tests/test_check_parallel.py Show resolved Hide resolved
(2, 10, 3),
],
)
def test_map_reduce(self, num_files, num_jobs, num_checkers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably, and more neatly, either:

  1. merge this with my original (awful?) code (put linter.register_checker(ExtraParallelTestChecker(linter)) next to linter.register_checker(ExtraSequentialTestChecker(linter))
  2. create a shared function that accepts the ExtraSequentialTestChecker/ExtraParallelTestChecker types as params?

Either way, I apologize for writing the code this test is based on, it was the only way I could think to do it in the time I had.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging both tests, that does almost the same thing, would take less lines, but more complicated.

The first one can pass while the second can fail (before this PR), if we merge them I fear it'll add complexity while debugging the day it'll fail again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this be more readable:

file_infos = _gen_file_datas(num_files)

checkers = [
    ParallelTestChecker,
    ExtraParallelTestChecker,
    ThirdParallelTestChecker,
][:num_checkers]

# Establish the baseline:
linter = PyLinter(reporter=Reporter())
for checker in checkers[:num_checkers]:
    linter.register_checker(checker(linter))
assert linter.config.jobs == 1, "jobs>1 are ignored when calling _check_files"
linter._check_files(linter.get_ast, file_infos)
stats_single_proc = linter.stats

# Run the same in parallel:
linter = PyLinter(reporter=Reporter())
for checker in checkers[:num_checkers]:
    linter.register_checker(checker(linter))
check_parallel(linter, jobs=num_jobs, files=file_infos, arguments=None)
stats_check_parallel = linter.stats

assert (
    stats_single_proc["by_msg"] == stats_check_parallel["by_msg"]
), "Single-proc and check_parallel() should return the same thing"

? It's 9 lines shorter, mostly due to the reduced indentation that allow to use longer lines.

"""Reduces and recombines data into a format that we can report on

The partner function of get_map_data()"""
recombined = SimilarChecker(linter)
recombined.min_lines = self.min_lines # Copy down relevant options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way it is probably needed to add other options:

        recombined.ignore_comments = self.ignore_comments
        recombined.ignore_docstrings = self.ignore_docstrings
        recombined.ignore_imports = self.ignore_imports
        recombined.ignore_signatures = self.ignore_signatures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's focus on one issue at a time.

(no I'm joking! Thanks for the carefull review, all, it's heart warming.)

As far as I understand it's not needed, because the ignore_* are handled during file collection, which happen on the "worker processes", before this recombination step.

Then filtered lines are given back to the "main process" which (in this "recombined" step) actually search for duplicates (hence the need for min_lines here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JulienPalard in the MR #4565 i had to add those lines in order to retrieve the same options in each process. But maybe i'm mistaken...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood, reduce_map_data is called only once from the "main" process to aggregate already collected lines from sub-processes. The ignore_* had already been taken care of by the subprocess: they just didn't gathered those lines.

Maybe in your new algorithms, the sub-process are ignoring ignore_*, and you handle them in the main process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge this MR and keep that in mind when we rebase on master for #4565 ? Or maybe add a test for ignore_comments, docstrings, imports or signatures consistency with multiprocessing right now to be sure ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-Sassoulas i won't be able to investigate this before this weekend at best. So I think we could merge this as is. If @JulienPalard wants to add tests for ignore options it can be interesting but it is not mandatory for this PR which is "min-lines" oriented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when I'll have free time in the near future.

Anyway if #4565 pass the ignore_* parameter that may not be usefull it does not break anything, so I have nothing against passing them. It reduces surprise and break nothing: it's not bad (and could even become usefull in the future, who knows).

@Pierre-Sassoulas
Copy link
Member

We merged #4565 today, where the changes were already introduced in main. We're going to essentially check the result with the new tests here, which is nice.

@Pierre-Sassoulas Pierre-Sassoulas merged commit c00b07b into pylint-dev:main Jul 28, 2021
@JulienPalard JulienPalard deleted the mdk/similar-min-lines branch September 2, 2021 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🪲 Checkers Related to a checker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pylint 2.7.0 seems to ignore the min-similarity-lines setting
6 participants