New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similar check: Passe min_lines to recombined. #4175
Similar check: Passe min_lines to recombined. #4175
Conversation
This was solve the issue in hand. But its a bit specific in this case. A couple of points:
I’m not 100% sure, but it seems to me that the checker needs to be constructed in a way that lets the linter set the config on it. |
I don't think other parameters are used in the reduce step, but again, it's really specific to SimilarChecker.
Which I would not recommend, it's not that readable...
Or should |
I thought the same thing, revisiting the code. But it, unfortunately, makes no difference and I do not mind admitting that that baffles me. I must be missing something (?).
You are correct, the other settings are (currently) used when the lines are collected, not when analysed. 👍 This patch does fix this specific issue, but it masks the deeper issue. That said, I strongly suspect that it might be more useful to users of pylint to get the fix in, despite the general performance problem with MJ and the masking. There should be another way to demonstrate the deeper problem anyway. |
I tested your branch with a test project:
vs
the first one does not report, while the 2nd one do report, so I think your fix is good but not your test, which I did not read yet. I also tried by adding |
Ah thank you, I mispoke when I said ' makes no difference', it does improve things, fixing the similarities report as you say. Do you get the same quality scores in both cases? I am seeing there are still differences between single and MJ. The tests on that branch, which admitedly are a bit OTT, use a config-file to set the various settings for the runs. Based on your manual runs, I have added some which use CLI args as well. Thankfully I see the same effect. |
Damned, no. I'm getting 10.00 with |
Thanks for confirming. |
I just saw this PR. I opened #4178 a couple days ago for the same issue. Do we need to instantiate a new checker? If we could use the existing one in the linter object, it's already configured properly. |
Hello, reading the discussion around this MR it seems the problem lie in the option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers would help to fix this issue easily, is that right ? |
Option parsing does not look internal to the checker: the checker only expose an When parsed, the options are injected to the checker via a Problem is, the @classmethod
def reduce_map_data(cls, linter, data):
"""Reduces and recombines data into a format that we can report on
The partner function of get_map_data()"""
recombined = SimilarChecker(linter)
... when we create another Discussions above showed three ways to find the "option-aware"
I'm not fan of the first one, mine, it really looks tangled. I'm not fan of the 2nd one neither, it does the same thing, it's shorter, but using a private attribute. I'm not fan of turning reduce to a normal method: Having But we could still use a normal method and create a clean new instance of Similar from here, let me try that. |
Hum, it looks similar from what I did in here finally, diff from current PR and proposed variation: - @classmethod
- def reduce_map_data(cls, linter, data):
+ def reduce_map_data(self, linter, data):
"""Reduces and recombines data into a format that we can report on
The partner function of get_map_data()"""
recombined = SimilarChecker(linter)
- checker = [c for c in linter.get_checkers() if c.name == cls.name][0]
- recombined.min_lines = checker.min_lines
+ recombined.min_lines = self.min_lines
recombined.open()
Similar.combine_mapreduce_data(recombined, linesets_collection=data)
recombined.close() On both cases we use a clean, new |
I prefer the version without |
This MR fixes the checker's internal config. It doesn't fix the issue where
the code-quality numbers differ (in the final report).
So, the final reporting needs to gather (aka reduce) across each job's
linters (not just checkers) in mutli-proc mode, after the checker reduce
has happened. It's really opaque to me how to do that.
…On Tue, 8 Jun 2021 at 07:47, Pierre Sassoulas ***@***.***> wrote:
Hello, reading the discussion around this MR it seems the problem lie in
the option parsing being internal to the checker and refactoring that
aspect so the option are parsed once and then injected into the checkers
would help to fix this issue easily, is that right ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4175 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQP37H7ROBBHKC5WWSPP7LTRW4J3ANCNFSM4YPFNOLQ>
.
|
True, here's how I can reproduce it: mkdir mdk-test
cd mdk-test # This is to avoid using the pylint from pylint
cp ../pylint/constants.py ./file1.py
cp ../pylint/constants.py ./file2.py
pylint file1.py file2.py # Gives 8.16/10, similar lines found
pylint --min-similarity-lines=999 file1.py file2.py # Gives 8.42/10, no similar lines found
pylint --jobs 4 file1.py file2.py # 8.42/10 (bad), similar lines found (good)
pylint --jobs 4 --min-similarity-lines=999 file1.py file2.py # 8.42/10 (good), no similar line (good) In I'm slowly understanding it, in |
OK I was able to fix the stats issue: the stats generated during the reduce phase was left alone, not merged with all other stats. |
I'm trying to write the tests, looks like there's still an issue with the _merge_stats. |
Finally understood the stats merge issue \o/ I'm happy with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look like a nice change (the test line added to line changed ratio is through the roof 🎺 🎉 ). I have a small comment about the changelog and a hard time to understand the test but for the latter that's probably on me 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I can understand :( The idea of the test is to check that running with So it focus on testing the score, not the I can work on a test dedicated to |
(2, 10, 3), | ||
], | ||
) | ||
def test_map_reduce(self, num_files, num_jobs, num_checkers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably, and more neatly, either:
- merge this with my original (awful?) code (put
linter.register_checker(ExtraParallelTestChecker(linter))
next tolinter.register_checker(ExtraSequentialTestChecker(linter))
- create a shared function that accepts the
ExtraSequentialTestChecker
/ExtraParallelTestChecker
types as params?
Either way, I apologize for writing the code this test is based on, it was the only way I could think to do it in the time I had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging both tests, that does almost the same thing, would take less lines, but more complicated.
The first one can pass while the second can fail (before this PR), if we merge them I fear it'll add complexity while debugging the day it'll fail again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this be more readable:
file_infos = _gen_file_datas(num_files)
checkers = [
ParallelTestChecker,
ExtraParallelTestChecker,
ThirdParallelTestChecker,
][:num_checkers]
# Establish the baseline:
linter = PyLinter(reporter=Reporter())
for checker in checkers[:num_checkers]:
linter.register_checker(checker(linter))
assert linter.config.jobs == 1, "jobs>1 are ignored when calling _check_files"
linter._check_files(linter.get_ast, file_infos)
stats_single_proc = linter.stats
# Run the same in parallel:
linter = PyLinter(reporter=Reporter())
for checker in checkers[:num_checkers]:
linter.register_checker(checker(linter))
check_parallel(linter, jobs=num_jobs, files=file_infos, arguments=None)
stats_check_parallel = linter.stats
assert (
stats_single_proc["by_msg"] == stats_check_parallel["by_msg"]
), "Single-proc and check_parallel() should return the same thing"
? It's 9 lines shorter, mostly due to the reduced indentation that allow to use longer lines.
pylint/checkers/similar.py
Outdated
"""Reduces and recombines data into a format that we can report on | ||
|
||
The partner function of get_map_data()""" | ||
recombined = SimilarChecker(linter) | ||
recombined.min_lines = self.min_lines # Copy down relevant options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way it is probably needed to add other options:
recombined.ignore_comments = self.ignore_comments
recombined.ignore_docstrings = self.ignore_docstrings
recombined.ignore_imports = self.ignore_imports
recombined.ignore_signatures = self.ignore_signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's focus on one issue at a time.
(no I'm joking! Thanks for the carefull review, all, it's heart warming.)
As far as I understand it's not needed, because the ignore_* are handled during file collection, which happen on the "worker processes", before this recombination step.
Then filtered lines are given back to the "main process" which (in this "recombined" step) actually search for duplicates (hence the need for min_lines
here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JulienPalard in the MR #4565 i had to add those lines in order to retrieve the same options in each process. But maybe i'm mistaken...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, reduce_map_data is called only once from the "main" process to aggregate already collected lines from sub-processes. The ignore_* had already been taken care of by the subprocess: they just didn't gathered those lines.
Maybe in your new algorithms, the sub-process are ignoring ignore_*, and you handle them in the main process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we merge this MR and keep that in mind when we rebase on master for #4565 ? Or maybe add a test for ignore_comments, docstrings, imports or signatures consistency with multiprocessing right now to be sure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pierre-Sassoulas i won't be able to investigate this before this weekend at best. So I think we could merge this as is. If @JulienPalard wants to add tests for ignore options it can be interesting but it is not mandatory for this PR which is "min-lines" oriented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know when I'll have free time in the near future.
Anyway if #4565 pass the ignore_* parameter that may not be usefull it does not break anything, so I have nothing against passing them. It reduces surprise and break nothing: it's not bad (and could even become usefull in the future, who knows).
We merged #4565 today, where the changes were already introduced in main. We're going to essentially check the result with the new tests here, which is nice. |
Steps
doc/whatsnew/<current release.rst>
.Description
It closes #4118 by passing the option to the merger instance (the
recombined.min_lines = self.min_lines
part).While working on it @doublethefish spotted the score was wrong: errors raised during the merge were not accounted for, it is now (the
linter.stats = _merge_stats([linter.stats] + all_stats)
part).I tried to implement a test to avoid regression, I'm not fluent with pylint internals yet, reviews welcome ♥
Type of Changes
Related Issue
Closes #4118