Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent multiple collecting of the same file due to recursive symlinks #7355

Closed

Conversation

aklajnert
Copy link
Contributor

@aklajnert aklajnert commented Jun 12, 2020

Fixes: #624.

The following directory structure will now be collected correctly.

.
├── link -> .
└── test_recursive.py

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aklajnert,

  1. pytest already has a mechanism to handle duplicate files, see FSCollector._collectfile. I think we should enhance it handle symlinks correctly, instead of handling that separately.

  2. A single readlink is actually not enough, as it is possible to have symlinks to symlinks, etc. So I recommend using pathlib.Path.resolve. There may be similar functions in py.path.local or os.path, I haven't checked.

@aklajnert
Copy link
Contributor Author

Hi @bluetech, thanks for the hints. Fair point about the symlink to symlink case, I'll work on it later.

@aklajnert
Copy link
Contributor Author

Hi @bluetech, I've looked at the FSCollector._collectfile, but it doesn't seem to be a good fit for this task, as it is responsible for collecting files, not directories. I can't see any other place suitable to filter out the seen directories.

The symlink to symlink case has been fixed and covered by tests.

@bluetech
Copy link
Member

Given this directory tree

rec/
├── link -> .
├── sub
│   └── test_foo.py
└── test_recursive.py

I tried 3 versions:


Current master:

rec/test_recursive.py .[  1%]
rec/link/test_recursive.py .[  2%]
rec/link/link/test_recursive.py .[  3%]
rec/link/link/link/test_recursive.py . [  4%]
rec/link/link/link/link/test_recursive.py . [  6%]
... [snipped] ...

i.e. the catastrophic behavior.


This PR (commit " Remove the hasattr() check "):

rec/test_recursive.py . [ 25%]
rec/link/test_recursive.py . [ 50%]
rec/link/sub/test_foo.py . [ 75%]
rec/sub/test_foo.py . [100%]

Looks like it still collects the files more than once (twice in this scenario).


This patch:

diff --git a/src/_pytest/config/__init__.py b/src/_pytest/config/__init__.py
index 4ff6ce707..62213f27a 100644
--- a/src/_pytest/config/__init__.py
+++ b/src/_pytest/config/__init__.py
@@ -322,7 +322,7 @@ class PytestPluginManager(PluginManager):
         self._conftestpath2mod = {}  # type: Dict[Any, object]
         self._confcutdir = None  # type: Optional[py.path.local]
         self._noconftest = False
-        self._duplicatepaths = set()  # type: Set[py.path.local]
+        self._duplicatepaths = set()  # type: Set[str]
 
         self.add_hookspecs(_pytest.hookspec)
         self.register(self)
diff --git a/src/_pytest/nodes.py b/src/_pytest/nodes.py
index 4c7aa1bcd..b085fca67 100644
--- a/src/_pytest/nodes.py
+++ b/src/_pytest/nodes.py
@@ -595,10 +595,11 @@ class FSCollector(Collector):
             keepduplicates = self.config.getoption("keepduplicates")
             if not keepduplicates:
                 duplicate_paths = self.config.pluginmanager._duplicatepaths
-                if path in duplicate_paths:
+                resolved_path = str(Path(str(path)).resolve())
+                if resolved_path in duplicate_paths:
                     return ()
                 else:
-                    duplicate_paths.add(path)
+                    duplicate_paths.add(resolved_path)
 
         return ihook.pytest_collect_file(path=path, parent=self)  # type: ignore[no-any-return] # noqa: F723
 

result:

rec/test_recursive.py . [ 50%]
rec/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/sub/test_foo.py . [100%]

This one only collects once, but has a recursion issue.


So it seems that the _collectfile does work, but needs some fix, if it's possible.

@aklajnert
Copy link
Contributor Author

Thanks @bluetech. I've noticed that the double collection was gone from my tests after I handled symlink to symlink case, but it seems that my test case isn't as good as I thought since the issue is still there.

@aklajnert aklajnert force-pushed the 624_prevent_recursive_symlinks branch from fc3c3b1 to c875abe Compare June 17, 2020 17:17
@aklajnert
Copy link
Contributor Author

Hi @bluetech, I've changed the _collectfile() method to remember the absolute path with resolved symlinks. That does the trick with collecting a test multiple times but doesn't solve the performance problem.

Here's the output with my changes in _recurse() method commented out:

collected 1 item

test_foo.py .                                                                                 [100%]

=================================== 1 passed in 92.61s (0:01:32) ====================================

With changes in _recurse() in place:

collected 1 item

test_foo.py .                                                                                 [100%]

========================================= 1 passed in 0.19s =========================================

As you see, the difference in execution time is huge. I've tried to profile it - it seems like too many collected directories are causing the slowdown (cProfile snapshots: snapshots.zip).

Basically, because 40 directories were collected, the isfile() method has been called 2024 times which took 67% of the execution time. The _getconftestmodules() method could use some optimization (maybe os.walk() could help here?), but I'm not sure if that's the scope of this PR.

@aklajnert aklajnert requested a review from bluetech July 3, 2020 06:00
@bluetech
Copy link
Member

bluetech commented Jul 7, 2020

Hi @aklajnert, thanks for the update, I think I see why you changed _recurse now. I'll clear some time to understand the collection code more so I can evaluate the PR better.

Is there a chance you'll be interested to do a little refactoring in this area as well? In particular what you said "maybe os.walk() could help here?" -- I think we should really try to move away from py.path.local's visit methods which are quite hard to follow with all the callbacks. And we can also speed it up by switching to os.scandir instead of the os.listdir which py is using internally.

I imagine we can't use os.walk itself but can do something like this:

  1. First commit: Import py's visit implementation as-is to _pytest.pathlib as a standalone function, switch all code to use that, make sure all tests pass.
  2. Second commit: simplify the implementation as much as possible -- dropping Python 2 support, dropping unused arguments, stuff like that.
  3. Third commit: switch it to use os.scandir and avoid extra syscalls as much as possible (use fields from the os.DirEntry instead).
  4. Profit!

@nicoddemus
Copy link
Member

While I definitely agree with @bluetech's suggestion of moving away from os.path.local.visit, I think this is better done in a separate PR, so it is easier to review the changes which are independent from the original PR.

@aklajnert
Copy link
Contributor Author

Hi @bluetech - I'll be happy to work on the refactor, but I agree with @nicoddemus that this should be done in a separate PR. I'll work on it when I find some time (probably within a week or two).

@bluetech
Copy link
Member

bluetech commented Jul 8, 2020

I still haven't dug into the collection code but it occurs to me that the strategy of completely resolving all symlinks and deduplicating them can be problematic. Consider for example the following test tree:

$ tree top/
top/
├── a
│   ├── conftest.py
│   ├── __init__.py
│   └── test_it.py
├── b
│   ├── conftest.py
│   ├── __init__.py
│   └── test_it.py -> ../a/test_it.py
└── __init__.py

With the following file contents:

# top/a/conftest.py
import pytest

@pytest.fixture
def conftest_fixture():
    return "a"
# top/b/conftest.py
import pytest

@pytest.fixture
def conftest_fixture():
    return "b"
# top/a/test_it.py
def test_it(conftest_fixture):
    print(conftest_fixture)

This symlinks the same file to multiple modules, but the outcome is different because module-level fixtures.

Before this PR the output is

$ pytest -s -q top/
a
.b
.
2 passed in 0.02s

After:

$ pytest -s -q top/
a
.
1 passed in 0.02s

This example is a little convoluted but I can imagine real use cases for this and it's seems legitimate.

@aklajnert
Copy link
Contributor Author

Hi @bluetech - fair point. The example indeed seems a bit convoluted, but I see it could be useful in some cases.
Maybe the change should just limit to directory symlinks? Or detect recursive symlink? Not sure how to do that without hurting the performance.

Btw, your comment seems a bit broken here, as if you have pasted something twice.

@bluetech
Copy link
Member

bluetech commented Jul 8, 2020

Oops, I fixed the comment now.

@aklajnert aklajnert force-pushed the 624_prevent_recursive_symlinks branch 2 times, most recently from 0702b25 to 479d048 Compare August 3, 2020 13:58
@aklajnert
Copy link
Contributor Author

Hi @bluetech - I'm sorry that I couldn't work on it earlier. I've tried to rebase and work on it today but I've noticed that you did the refactor mentioned here in #7541.
I also noticed that the recursive symlink problem is now gone in master (the latest release still has it). Not sure it was intentional or accidental, my guess is for the second option. As proof, I've removed all my changes but tests, and they are still passing. Interestingly - the performance problem mentioned in #624 is resolved in the released version but the recursive symlinks is fixed in master.

Now, I'm not sure if the PR should be removed completely, or should we address your observations from the comment above.

@bluetech
Copy link
Member

bluetech commented Aug 4, 2020

I'm sorry that I couldn't work on it earlier. I've tried to rebase and work on it today but I've noticed that you did the refactor mentioned here in #7541.

Oh right I should have mentioned it back here. I did it as part of the effort to remove py.path.local.

I also noticed that the recursive symlink problem is now gone in master (the latest release still has it). Not sure it was intentional or accidental, my guess is for the second option.

Yes I don't remember any change that intentionally changes things here, but presumably it is #7541. I will check this when I get the chance.

Interestingly - the performance problem mentioned in #624 is resolved in the released version but the recursive symlinks is fixed in master.

Maybe related to #5965 / #6523? I procrastinated on this PR over checking what changed in that PR.

As proof, I've removed all my changes but tests, and they are still passing.
...
Now, I'm not sure if the PR should be removed completely, or should we address your observations from the comment above.

In light of your comments I will also check what happens in the scenario I described in the comment in master. Depending on that we can merge your tests to solidify that they won't regress.

Base automatically changed from master to main March 9, 2021 20:40
@aklajnert
Copy link
Contributor Author

Hi @bluetech - I've just noticed that this is still open. Should I close it, or you'd like to merge these tests?

@bluetech
Copy link
Member

Hi @aklajnert, I obviously dropped the ball here and I don't even remember the exact details anymore, sorry about that.

Looking at the tests, they do look good to me (if they do indeed pass in main...). So if you could:

  • Rebase them on latest main
  • Convert testdir to pytester (testdir is now soft-deprecated)
  • See that PR is green

Then I think we can merge this.

@aklajnert aklajnert force-pushed the 624_prevent_recursive_symlinks branch 5 times, most recently from 8dabbae to 4e43db6 Compare June 15, 2021 05:52
@aklajnert
Copy link
Contributor Author

@bluetech - seems like the bug is back according to my tests, or I've messed something up. I don't have time to investigate it now, but I'll look into that later.

@aklajnert aklajnert force-pushed the 624_prevent_recursive_symlinks branch from 4e43db6 to 0b527fa Compare October 10, 2021 14:03
@aklajnert aklajnert closed this Dec 10, 2022
@aklajnert aklajnert deleted the 624_prevent_recursive_symlinks branch December 10, 2022 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recursive symlink significantly slows collection
3 participants