Similar check: Passe min_lines to recombined. #4175

JulienPalard · 2021-03-02T15:18:14Z

Steps

Add yourself to CONTRIBUTORS if you are a new contributor.
Add a ChangeLog entry describing what your PR does.
If it's a new feature or an important bug fix, add a What's New entry in doc/whatsnew/<current release.rst>.
Write a good description on what the PR does.

Description

It closes #4118 by passing the option to the merger instance (the recombined.min_lines = self.min_lines part).

While working on it @doublethefish spotted the score was wrong: errors raised during the merge were not accounted for, it is now (the linter.stats = _merge_stats([linter.stats] + all_stats) part).

I tried to implement a test to avoid regression, I'm not fluent with pylint internals yet, reviews welcome ♥

Type of Changes

	Type
✓	🐛 Bug fix

Related Issue

Closes #4118

coveralls · 2021-03-02T15:23:08Z

Coverage remained the same at 92.037% when pulling bf7282a on JulienPalard:mdk/similar-min-lines into 655a9bf on PyCQA:master.

doublethefish · 2021-03-03T12:19:21Z

This was solve the issue in hand. But its a bit specific in this case.

A couple of points:

a test would be nice
it only fixes min_similar_lines and not the other configs for the SimilarChecker
other map/reduce checkers would need to parse out their parameters in a similar way

I’m not 100% sure, but it seems to me that the checker needs to be constructed in a way that lets the linter set the config on it.

JulienPalard · 2021-03-03T13:26:55Z

* it only fixes min_similar_lines and not the other configs for the SimilarChecker

I don't think other parameters are used in the reduce step, but again, it's really specific to SimilarChecker.

* other map/reduce checkers would need to parse out their parameters in a similar way

Which I would not recommend, it's not that readable...

I’m not 100% sure, but it seems to me that the checker needs to be constructed in a way that lets the linter set the config on it.

Or should reduce_map_data be upgraded as a method instead of a classmethod, so it can directly access the config?

doublethefish · 2021-03-03T14:24:39Z

Or should reduce_map_data be upgraded as a method instead of a class method, so it can directly access the config?

I thought the same thing, revisiting the code. But it, unfortunately, makes no difference and I do not mind admitting that that baffles me. I must be missing something (?).

I don't think other parameters are used in the reduce step, but again, it's really specific to SimilarChecker.

You are correct, the other settings are (currently) used when the lines are collected, not when analysed. 👍

This patch does fix this specific issue, but it masks the deeper issue. That said, I strongly suspect that it might be more useful to users of pylint to get the fix in, despite the general performance problem with MJ and the masking. There should be another way to demonstrate the deeper problem anyway.

JulienPalard · 2021-03-03T16:35:51Z

I thought the same thing, revisiting the code. But it, unfortunately, makes no difference and I do not mind admitting that that baffles me. I must be missing something (?).

I tested your branch with a test project:

(cd /tmp/testproj/; pylint --min-similarity-lines=2000 testproj)

vs

(cd /tmp/testproj/; pylint --min-similarity-lines=2 testproj)

the first one does not report, while the 2nd one do report, so I think your fix is good but not your test, which I did not read yet.

I also tried by adding -j 1, -j 2, and -j 16, and had no issue, so it looks fixed from command line but not from config file?

doublethefish · 2021-03-03T18:26:43Z

Ah thank you, I mispoke when I said ' makes no difference', it does improve things, fixing the similarities report as you say.

Do you get the same quality scores in both cases? I am seeing there are still differences between single and MJ.

The tests on that branch, which admitedly are a bit OTT, use a config-file to set the various settings for the runs. Based on your manual runs, I have added some which use CLI args as well. Thankfully I see the same effect.

JulienPalard · 2021-03-04T09:39:52Z

Do you get the same quality scores in both cases? I am seeing there are still differences between single and MJ.

Damned, no. I'm getting 10.00 with -j 20 and 9.98 with -j 1. I have no idea how this is computed yet, and I need to go back to $DAYJOB for now.

doublethefish · 2021-03-04T09:42:07Z

Thanks for confirming.

shvenkat · 2021-03-04T17:30:42Z

I just saw this PR. I opened #4178 a couple days ago for the same issue.

Do we need to instantiate a new checker? If we could use the existing one in the linter object, it's already configured properly.

Pierre-Sassoulas · 2021-06-08T06:47:43Z

Hello, reading the discussion around this MR it seems the problem lie in the option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers would help to fix this issue easily, is that right ?

JulienPalard · 2021-06-08T08:56:44Z

option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers

Option parsing does not look internal to the checker: the checker only expose an options class attribute wich is used for external parsing, which is good design IMHO.

When parsed, the options are injected to the checker via a def set_option callback, which is not bad neither.

Problem is, the set_option callback implies there's a single instance of the checker aware of the options, so in the current:

@classmethod
def reduce_map_data(cls, linter, data):
    """Reduces and recombines data into a format that we can report on

    The partner function of get_map_data()"""
    recombined = SimilarChecker(linter)
    ...

when we create another SimilarChecker, it's a "naive one", unaware of the options.

Discussions above showed three ways to find the "option-aware" SimilarChecer:

By getting the first one out of [c for c in linter.get_checkers() if c.name == cls.name], hoping for the best.
By getting the first one out of linter._checkers["similarities"], hoping for the best.
By making reduce_map_data an instance method instead of a classmethod, so self is the aware one.

I'm not fan of the first one, mine, it really looks tangled.

I'm not fan of the 2nd one neither, it does the same thing, it's shorter, but using a private attribute.

I'm not fan of turning reduce to a normal method: Having reduce_map_data being a classmethod instead of a method looks safer: it cleanly isolate the steps, which looks a good idea as the first step of the merge is to call open which itself destroys self.lineset: data could easily be lost.

But we could still use a normal method and create a clean new instance of Similar from here, let me try that.

JulienPalard · 2021-06-08T09:03:12Z

Hum, it looks similar from what I did in here finally, diff from current PR and proposed variation:

-    @classmethod
-    def reduce_map_data(cls, linter, data):
+    def reduce_map_data(self, linter, data):
         """Reduces and recombines data into a format that we can report on

         The partner function of get_map_data()"""
         recombined = SimilarChecker(linter)
-        checker = [c for c in linter.get_checkers() if c.name == cls.name][0]
-        recombined.min_lines = checker.min_lines
+        recombined.min_lines = self.min_lines
         recombined.open()
         Similar.combine_mapreduce_data(recombined, linesets_collection=data)
         recombined.close()

On both cases we use a clean, new SimilarChecker, and in both cases we copy the config.

JulienPalard · 2021-06-08T09:11:20Z

I prefer the version without checker = [c for c in linter.get_checkers() if c.name == cls.name][0] which looks like a bug magnet (Why there can be more than one? What happen if we pick the wrong one? ...)

doublethefish · 2021-06-08T09:38:22Z

This MR fixes the checker's internal config. It doesn't fix the issue where the code-quality numbers differ (in the final report). So, the final reporting needs to gather (aka reduce) across each job's linters (not just checkers) in mutli-proc mode, after the checker reduce has happened. It's really opaque to me how to do that.

…

On Tue, 8 Jun 2021 at 07:47, Pierre Sassoulas ***@***.***> wrote: Hello, reading the discussion around this MR it seems the problem lie in the option parsing being internal to the checker and refactoring that aspect so the option are parsed once and then injected into the checkers would help to fix this issue easily, is that right ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4175 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQP37H7ROBBHKC5WWSPP7LTRW4J3ANCNFSM4YPFNOLQ> .

JulienPalard · 2021-06-08T12:33:23Z

True, here's how I can reproduce it:

mkdir mdk-test
cd mdk-test  # This is to avoid using the pylint from pylint
cp ../pylint/constants.py ./file1.py
cp ../pylint/constants.py ./file2.py
pylint file1.py file2.py  # Gives 8.16/10, similar lines found
pylint --min-similarity-lines=999 file1.py file2.py  # Gives 8.42/10, no similar lines found
pylint --jobs 4 file1.py file2.py  # 8.42/10 (bad), similar lines found (good)
pylint --jobs 4 --min-similarity-lines=999 file1.py file2.py  # 8.42/10 (good), no similar line (good)

In PyLinter.stats in case of parallel runs, we're missing some data, for example self.stats['by_msg']['duplicate-code'] is missing when --jobs 2 but present when --jobs 1.

I'm slowly understanding it, in parallel.py the _merge_stats gets the stats from subprocesses, which does not know yet about the conflicting lines, we need to also merge with similar's map-reduced stats.

JulienPalard · 2021-06-08T13:40:41Z

OK I was able to fix the stats issue: the stats generated during the reduce phase was left alone, not merged with all other stats.

JulienPalard · 2021-06-08T15:11:17Z

I'm trying to write the tests, looks like there's still an issue with the _merge_stats.

JulienPalard · 2021-06-08T17:03:12Z

Finally understood the stats merge issue \o/

I'm happy with this PR.

Pierre-Sassoulas

This look like a nice change (the test line added to line changed ratio is through the roof 🎺 🎉 ). I have a small comment about the changelog and a hard time to understand the test but for the latter that's probably on me 😄

ChangeLog

tests/test_check_parallel.py

pylint/checkers/similar.py

Pierre-Sassoulas

👍

JulienPalard · 2021-06-08T19:54:38Z

a hard time to understand the test

I can understand :( The idea of the test is to check that running with --jobs 1 give the same result than --jobs n on a checker that raises messages during the reduce step (in the main process) instead of raising them from the worker processes.

So it focus on testing the score, not the min_lines parameter, and don't fail if we re-introduce the min_lines bug, which is not good.... ☹ (It fail if we break the "score propagation" fix though).

I can work on a test dedicated to min_line+jobs bug, but I need to sleep first ;)

tests/test_check_parallel.py

doublethefish · 2021-06-08T19:55:45Z

tests/test_check_parallel.py

+            (2, 10, 3),
+        ],
+    )
+    def test_map_reduce(self, num_files, num_jobs, num_checkers):


You could probably, and more neatly, either:

merge this with my original (awful?) code (put linter.register_checker(ExtraParallelTestChecker(linter)) next to linter.register_checker(ExtraSequentialTestChecker(linter))

create a shared function that accepts the ExtraSequentialTestChecker/ExtraParallelTestChecker types as params?

Either way, I apologize for writing the code this test is based on, it was the only way I could think to do it in the time I had.

Merging both tests, that does almost the same thing, would take less lines, but more complicated.

The first one can pass while the second can fail (before this PR), if we merge them I fear it'll add complexity while debugging the day it'll fail again.

Would something like this be more readable:

file_infos = _gen_file_datas(num_files) checkers = [ ParallelTestChecker, ExtraParallelTestChecker, ThirdParallelTestChecker, ][:num_checkers] # Establish the baseline: linter = PyLinter(reporter=Reporter()) for checker in checkers[:num_checkers]: linter.register_checker(checker(linter)) assert linter.config.jobs == 1, "jobs>1 are ignored when calling _check_files" linter._check_files(linter.get_ast, file_infos) stats_single_proc = linter.stats # Run the same in parallel: linter = PyLinter(reporter=Reporter()) for checker in checkers[:num_checkers]: linter.register_checker(checker(linter)) check_parallel(linter, jobs=num_jobs, files=file_infos, arguments=None) stats_check_parallel = linter.stats assert ( stats_single_proc["by_msg"] == stats_check_parallel["by_msg"] ), "Single-proc and check_parallel() should return the same thing"

? It's 9 lines shorter, mostly due to the reduced indentation that allow to use longer lines.

hippo91 · 2021-06-09T14:12:06Z

pylint/checkers/similar.py

        """Reduces and recombines data into a format that we can report on

        The partner function of get_map_data()"""
        recombined = SimilarChecker(linter)
+        recombined.min_lines = self.min_lines  # Copy down relevant options


By the way it is probably needed to add other options:

recombined.ignore_comments = self.ignore_comments recombined.ignore_docstrings = self.ignore_docstrings recombined.ignore_imports = self.ignore_imports recombined.ignore_signatures = self.ignore_signatures

Let's focus on one issue at a time.

(no I'm joking! Thanks for the carefull review, all, it's heart warming.)

As far as I understand it's not needed, because the ignore_* are handled during file collection, which happen on the "worker processes", before this recombination step.

Then filtered lines are given back to the "main process" which (in this "recombined" step) actually search for duplicates (hence the need for min_lines here).

@JulienPalard in the MR #4565 i had to add those lines in order to retrieve the same options in each process. But maybe i'm mistaken...

As far as I understood, reduce_map_data is called only once from the "main" process to aggregate already collected lines from sub-processes. The ignore_* had already been taken care of by the subprocess: they just didn't gathered those lines.

Maybe in your new algorithms, the sub-process are ignoring ignore_*, and you handle them in the main process?

Should we merge this MR and keep that in mind when we rebase on master for #4565 ? Or maybe add a test for ignore_comments, docstrings, imports or signatures consistency with multiprocessing right now to be sure ?

@Pierre-Sassoulas i won't be able to investigate this before this weekend at best. So I think we could merge this as is. If @JulienPalard wants to add tests for ignore options it can be interesting but it is not mandatory for this PR which is "min-lines" oriented.

I don't know when I'll have free time in the near future.

Anyway if #4565 pass the ignore_* parameter that may not be usefull it does not break anything, so I have nothing against passing them. It reduces surprise and break nothing: it's not bad (and could even become usefull in the future, who knows).

Pierre-Sassoulas · 2021-07-28T19:14:13Z

We merged #4565 today, where the changes were already introduced in main. We're going to essentially check the result with the new tests here, which is nice.

Pierre-Sassoulas added Discussion 🤔 Work in progress labels Mar 3, 2021

shvenkat mentioned this pull request Mar 4, 2021

Don't reset similarities config to defaults in parallel mode #4178

Closed

4 tasks

Pierre-Sassoulas added this to the 2.9.0 milestone Jun 8, 2021

Pierre-Sassoulas added Bug 🪲 Checkers Related to a checker and removed Discussion 🤔 Work in progress labels Jun 8, 2021

Pierre-Sassoulas requested changes Jun 8, 2021

View reviewed changes

ChangeLog Outdated Show resolved Hide resolved

tests/test_check_parallel.py Show resolved Hide resolved

doublethefish reviewed Jun 8, 2021

View reviewed changes

pylint/checkers/similar.py Outdated Show resolved Hide resolved

Pierre-Sassoulas approved these changes Jun 8, 2021

View reviewed changes

doublethefish reviewed Jun 8, 2021

View reviewed changes

hippo91 reviewed Jun 9, 2021

View reviewed changes

hippo91 mentioned this pull request Jun 12, 2021

Performance improvment of the Similarity checker #4565

Merged

3 tasks

Pierre-Sassoulas modified the milestones: 2.9.0, 2.10.0 Jun 29, 2021

Pierre-Sassoulas mentioned this pull request Jun 30, 2021

duplicate-code: similarities is not working in pipenv virtual env #2965

Closed

Pierre-Sassoulas mentioned this pull request Jul 28, 2021

Wip - Puts issue #4118 under test - test min similarities #4179

Closed

4 tasks

Similar check: Passe min_lines to recombined.

a766040

Pierre-Sassoulas merged commit c00b07b into pylint-dev:main Jul 28, 2021

JulienPalard deleted the mdk/similar-min-lines branch September 2, 2021 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similar check: Passe min_lines to recombined. #4175

Similar check: Passe min_lines to recombined. #4175

JulienPalard commented Mar 2, 2021 •

edited

coveralls commented Mar 2, 2021 •

edited

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 3, 2021

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 3, 2021 •

edited

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 4, 2021

doublethefish commented Mar 4, 2021

shvenkat commented Mar 4, 2021

Pierre-Sassoulas commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

doublethefish commented Jun 8, 2021 via email

JulienPalard commented Jun 8, 2021 •

edited

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

Pierre-Sassoulas left a comment

Pierre-Sassoulas left a comment

JulienPalard commented Jun 8, 2021

doublethefish Jun 8, 2021

JulienPalard Jun 9, 2021

JulienPalard Jun 9, 2021

hippo91 Jun 9, 2021

JulienPalard Jun 9, 2021

hippo91 Jun 12, 2021

JulienPalard Jun 12, 2021

Pierre-Sassoulas Jun 17, 2021

hippo91 Jun 17, 2021

JulienPalard Jun 18, 2021

Pierre-Sassoulas commented Jul 28, 2021

Similar check: Passe min_lines to recombined. #4175

Similar check: Passe min_lines to recombined. #4175

Conversation

JulienPalard commented Mar 2, 2021 • edited

Steps

Description

Type of Changes

Related Issue

coveralls commented Mar 2, 2021 • edited

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 3, 2021

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 3, 2021 • edited

doublethefish commented Mar 3, 2021

JulienPalard commented Mar 4, 2021

doublethefish commented Mar 4, 2021

shvenkat commented Mar 4, 2021

Pierre-Sassoulas commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

doublethefish commented Jun 8, 2021 via email

JulienPalard commented Jun 8, 2021 • edited

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

JulienPalard commented Jun 8, 2021

Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

JulienPalard commented Jun 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierre-Sassoulas commented Jul 28, 2021

JulienPalard commented Mar 2, 2021 •

edited

coveralls commented Mar 2, 2021 •

edited

JulienPalard commented Mar 3, 2021 •

edited

JulienPalard commented Jun 8, 2021 •

edited