Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monorepo support using groups and path dependencies #6850

Closed
adriangb opened this issue Oct 20, 2022 · 27 comments · Fixed by #6845
Closed

Monorepo support using groups and path dependencies #6850

adriangb opened this issue Oct 20, 2022 · 27 comments · Fixed by #6845
Labels
kind/feature Feature requests/implementations status/triage This issue needs to be triaged

Comments

@adriangb
Copy link
Contributor

adriangb commented Oct 20, 2022

Following up on https://discord.com/channels/487711540787675139/1032534656777715732/1032549486230253580

Summary: I think we can make some relatively simple and universally (as in not only beneficial to monorepos) changes to the existing path dependencies that will make them much easier to use for monorepos. There are also some more complex feature requests to handle but I think those can be tabled for future feature work.

I had to migrate a Python monorepo at work from an ad-hoc structure to something with a more explicit separation of applications and components. I explored several options including monorepo build tools (Bazel, Pants, etc.) and Python package managers (Hatch, PDM and Poetry). I won't get into the details but the TLDR is that the build tools are too complex for a simple Python-only monorepo and the other package managers have missing features. There are several ways to structure monorepos with Poetry, but after surveying a lot of blog posts, issues and plugins I came across this one: https://github.com/martinxsliu/poetry_workspace_plugin (@martinxsliu if you're out there thank you and feel free to chime in). While I don't plan on using that Plugin necessarily (I don't need most of the functionality it provides) it had some interesting ideas. The main thing I changed was using Poetry's existing group feature to organize the path dependencies to projects so that even without a plugin you can get basic functionality.

The good

So the final pattern I landed on is this: https://github.com/adriangb/python-monorepo/tree/main/poetry

TLDR is:

# /pyproject.toml
[tool.poetry.group.app.dependencies]
namespace-app = { path = "workspaces/app", develop = true}

[tool.poetry.group.lib.dependencies]
namespace-lib = { path = "workspaces/lib", develop = true}

[tool.poetry.group.lib-test.dependencies]
namespace-lib = { path = "workspaces/lib", develop = true, extras=["test"]}

# /{workspace,projects}/app/pyproject.toml
[tool.poetry.dependencies]
namespace-lib = { path = "../lib"}

So you can then do:

poetry install --only app

Which will install app, lib and any 3rd party dependencies!

The bad

poetry lock --no-update doesn't work

This is alluded to in the plugin linked to above: https://github.com/martinxsliu/poetry_workspace_plugin/blob/master/poetry_workspace/plugin.py#L83-L107

Even worse: if you delete any path dependencies (delete the folder, delete it from pyproject.toml) poetry lock --no-update straight up crashes because DirectoryDependency is used to load the lock file (which still has a reference to the deleted dependency) and unlike VCS and other dependencies it validates that the path exists in __init__. This means your options are to (a) run poetry lock and update all 3rd party dependencies (b) delete it from pyproject.toml first, run poetry --lock-update, delete the folder and run poetry lock --no-update again) or (c) edit poetry.lock before running poetry lock --no-update. Especially with --no-update becoming the default at some point, this is problematic.

I've submitted a couple of PRs to fix this:

No caching in Docker builds because source is required to install 3rd party dependencies

When you're building a Dockerfile that uses Poetry it's best to use --no-root to install 3rd party dependencies first (which only requires a pyproject.toml and a poetry.lock lockfile) so that the layer which installs 3rd party dependencies gets cached across builds (it's often both the slowest and least frequently changed step). This breaks down if you have any path dependencies because you'll need to copy them over in order to install 3rd party dependencies (even using groups there's no way to skip the path dependencies but install its transitive 3rd party dependencies).

I opened #6845 as a way to address this. I'm not sure if --no-path is the best way; I'm open to other options, but the basic idea is there.

No way to build sdists/wheels for projects with path dependencies

In my example repo you can't run poetry build on cli because it depends on lib. One might want to do this if e.g. you have a backend and a client you are publishing on PyPi that share a common component/library. There are several open issues around this, which I won't get into detail on. My thoughts is that it would be nice to have a feature like namespace-lib = { path = "../lib", wheel = { version = "~={}.{}.{}"} } where the {} in version means "insert the current version" or namespace-lib = { path = "../lib", wheel = { target = "path/inside/wheel" } }. I think this quite complex of a topic that needs more thought. Even Cargo hasn't fully figured this out since they don't support inheriting the version from workspace dependencies (their overrides feature may be of interest but is really beyond the scope of this issue). I think this can be solved but I would table it for now since it needs more thought and will likely be more invasive to the codebase.

#1168 also is related to this particular use case and might be an alternative path forward.

No support for dependency groups in path dependencies

It would be interesting to have some special handling for Poetry path dependencies so that we can "propagate" groups from path dependencies to workspace/root projects (for example, choosing to install the test group in a path dependency). I think this can be done but I don't think it should be a blocker for other stuff, also going to table this for now.

The ugly

Boilerplate

There's a lot boilerplate involved in the whole groups and path deps thing, especially if you get into listing extras. I think this is where a plugin might shine: it could completely replace these sections with a [tool.workspaces] section or just serve to keep them in sync with the filesystem (to avoid manually writing the boilerplate).

@adriangb adriangb added kind/feature Feature requests/implementations status/triage This issue needs to be triaged labels Oct 20, 2022
@adriangb
Copy link
Contributor Author

cc @finswimmer because you asked for an issue on Discord 😃

@1oglop1
Copy link

1oglop1 commented Oct 21, 2022

@adriangb Do you think your solution is somehow similar to this? https://github.com/dermidgen/python-monorepo

@adriangb
Copy link
Contributor Author

adriangb commented Oct 21, 2022

They are similar but not equivalent because:

  • That example has multiple lock files. Having a single lock file is a key part of a functional monorepo setup
  • It is somehow replacing local dependencies for published ones. I use the local dependencies in production. I think being able to swap them out is a valuable feature but as I called out in the OP I think it should be treated separately.

@neersighted
Copy link
Member

Copying comments from #6845 (comment) and #6845 (comment)


[...] I don't think path dependencies should be special at all; encouraging users to think of them as 'special' takes us down a path we have strictly wanted to avoid (e.g. leaking metadata beyond the Core Metadata spec between layers of the dependency tree because Poetry is used all the way down), and reduces the interoperability of Poetry with the rest of the ecosystem.

I think [...] a workspace-like approach like suggested in #2270 is what we should explore, instead of trying to tear down the Core Metadata/PEP 517 wall between Poetry projects.

[...]
While I brought up the overall direction for better supporting monorepos as kind of a extension of my thinking re: this PR, I did not realize that is at the core of what you want to do.

I think that makes my objection firmer -- special support for using path dependencies in this way (with a superproject that delegates to subprojects) is not desirable because it is ultimately incomplete unless we tear down the metadata barrier, and all the designs along that line have been rejected to date. It's possible we decide that the subproject approach is not tenable and go that way after all, but I think it would have to be comprehensive and not piecemeal. It would probably be better if it was a new dependency type if that were the case, instead of overloading path =, but I digress.


This is effectively in reply to

No support for dependency groups in path dependencies

It would be interesting to have some special handling for Poetry path dependencies so that we can "propagate" groups from path dependencies to workspace/root projects (for example, choosing to install the test group in a path dependency). I think this can be done but I don't think it should be a blocker for other stuff, also going to table this for now.

This special handling (as proposed in this issue and as implied by the related PRs) has been explicitly rejected as undesirable up to this point, and while a seemingly obvious direction for extension, directly restricts and contradicts efforts to keep Poetry as interoperable as possible. When introducing net-new functionality into the ecosystem, we've generally tried to be very methodical as we will have to maintain it for a long time (even if the design is later shown to be flawed), and it will create confusion for our users (e.g. see Poetry's advanced source handling, which is far beyond that of any other tool, and frequently a source of confusion).

As I said in my comments above, I don't think it's impossible we go that direction, but I do think we're going to draw the line at implementing anything intended to encourage that pattern until a comprehensive design is proposed and accepted.

@adriangb
Copy link
Contributor Author

First an update on the situation in the OP.

poetry lock --no-update doesn't work

Several PRs have since addressed this: #6843, #6844 and python-poetry/poetry-core#520. This is now completely solved and a non-issue.

No caching in Docker builds because source is required to install 3rd party dependencies

This still needs something like #6845 if we want to get good docker caching.

Boilerplate

I think this can easily be solved either with a plugin that reduces boilerplate by internally expanding each entry in some [tool.workspaces] into a dependency group and path dependency or with a CLI that keeps the boilerplate in sync.

No support for dependency groups in path dependencies

I don't think this is that big of a deal. I've had good luck using the tried and true pattern of making a test extra in each subproject and then having a {subproject} and {subproject}-test dependency groups where the latter uses the test extra from the subproject (example).

I do agree that if Poetry wanted to support this it should probably not bastardize the path dependency concept to achieve it. At that point it would make more sense to have a completely new type of "dependency" that is aware of the fact that the subproject is a Poetry project.

@adriangb
Copy link
Contributor Author

adriangb commented Jan 23, 2023

I don't think path dependencies should be special at all; encouraging users to think of them as 'special' takes us down a path we have strictly wanted to avoid (e.g. leaking metadata beyond the Core Metadata spec between layers of the dependency tree because Poetry is used all the way down), and reduces the interoperability of Poetry with the rest of the ecosystem.

The proposal in #6845 does not make any assumptions about the subprojects / path dependencies and they don't have to be Poetry projects for it to work. So I don't think there's any leaking of metadata because the subprojects are Poetry projects. All this proposal does is add a an argument to the poetry CLI; the CLI itself doesn't need to be interoperable with the rest of the ecosystem. So yes path dependencies are getting their own CLI argument (I guess one could call this "special" treatment) but I don't think this impacts Poetry's interoperability in any way.

It would be interesting to have some special handling for Poetry path dependencies so that we can "propagate" groups from path dependencies to workspace/root projects (for example, choosing to install the test group in a path dependency). I think this can be done but I don't think it should be a blocker for other stuff, also going to table this for now.

This special handling (as proposed in this issue and as implied by the related PRs) has been explicitly rejected as undesirable up to this point

I specifically said "let's table this for now" because I agree with you: this is more complicated and requires some design work. This issue and the associated PRs (of which only #6845 is left really) don't proposing doing anything like this.

When introducing net-new functionality into the ecosystem, we've generally tried to be very methodical as we will have to maintain it for a long time

As a maintainer I understand that every feature carries a burden of future maintenance. I specifically designed the implementation in #6845 to be as simple as possible: it is only a couple lines of relatively simple code (boilerplate for the CLI argument and a single logical statement). Most of the PR is documentation and tests. I don't think maintenance burden would be too bad even if Poetry decided to pursue some other implementation of "subprojects" down the road.

As I said in my comments above, I don't think it's impossible we go that direction, but I do think we're going to draw the line at implementing anything intended to encourage that pattern until a comprehensive design is proposed and accepted.

Part of the reason I'm pursuing this pattern is that it can be implemented mostly by fixing some existing bugs/weird behavior (most of which we have done) and implement some minor non-intrusive features (like the CLI argument in #6845). There have been several "comprehensive designs" proposed and they've all died at some point (#2270 has been open for nearly 3 years with no progress being made). I think there's a lot to be said for making some minor changes now, learning from real world usage and then using this feedback to inform any future design decision. As mentioned above #6845 is the main thing missing for this approach and if that needs to be maintained for a couple years and then deprecated once a new comprehensive design is proposed and accepted then it's a price worth paying to keep moving forward.

That said, just for clarify, what does it mean for a comprehensive deign to be proposed and approved? I've mostly seen PRs or issues being opened. Is there some sort of formal PEP-like process? Who oversees it or makes the decisions?

@adriangb
Copy link
Contributor Author

@radoering since you've worked on some of the related issues your input here would be very much appreciated

@mwgamble
Copy link

No caching in Docker builds because source is required to install 3rd party dependencies

I've been working around this by using touch to create a whole bunch of dummy files where the subprojects should be, just enough to make Poetry happy enough to get dependencies installed. It's verbose and ugly (each subproject must have multiple dummy files created for it), and I can't imagine my colleagues will be particularly happy with it :(

No way to build sdists/wheels for projects with path dependencies

I've been using this plugin to build wheels for projects with path dependencies:

https://github.com/artisanofcode/poetry-stickywheel-plugin/

It's been working great so far, but as the name suggests, it doesn't work at all for sdists:

artisanofcode/poetry-stickywheel-plugin#10

For my current use case that's fine, wheels are good enough.

@adriangb
Copy link
Contributor Author

No caching in Docker builds because source is required to install 3rd party dependencies

I've been working around this by using touch to create a whole bunch of dummy files where the subprojects should be, just enough to make Poetry happy enough to get dependencies installed. It's verbose and ugly (each subproject must have multiple dummy files created for it), and I can't imagine my colleagues will be particularly happy with it :(

That does sound ugly 🙃. Aside from #6845 the other workaround I’ve seen/used is to use the poetry export plugin to export a text file for each subproject/dependency group then manipulate that text file to remove path dependencies so you can pip install the third depa.

I've been using this plugin to build wheels for projects with path dependencies:

https://github.com/artisanofcode/poetry-stickywheel-plugin/

It's been working great so far, but as the name suggests, it doesn't work at all for sdists:

artisanofcode/poetry-stickywheel-plugin#10

For my current use case that's fine, wheels are good enough.

Thank you for linking that project, I’ll take a look! I suspect that wheels are more than good enough for 98% of use cases.

@mwgamble
Copy link

the other workaround I’ve seen/used is to use the poetry export plugin to export a text file for each subproject/dependency group then manipulate that text file to remove path dependencies so you can pip install the third depa.

I've also had to employ this trick for a specific use case :(

@adriangb
Copy link
Contributor Author

adriangb commented Mar 4, 2023

@mwgamble so just to confirm, would #6845 fit your use case for building images with caching?

@conor-james-sheehan
Copy link

Hey @adriangb, just wanted to add my support for this issue.

I also manage a poetry monorepo and my biggest gripe is the lack of Docker caching.

From what I can see, #6845 would fit our use cases for this

@radoering
Copy link
Member

I will try to describe my understanding. Please correct me if some of my assumptions are wrong.

If I understand correctly, the only use case to skip all path dependencies during installation is docker (or similar) caching. This is especially useful in a monorepo use case but is not restricted to monorepos.

Every time you run poetry install with this flag, you want to run poetry install without this flag later, don't you? Otherwise, you have an incomplete installation and stuff might fail.
☝️ We have to make this clear in the cli docs and warn that without a subsequent proper poetry install, you will end up with a incomplete/broken environment.

Dependency groups can't be used for this use case, because there might be transitive path dependencies and non-path sub-dependencies of path dependencies that shouldn't be ignored in order to cache as much as possible. (Maybe, we should write that somewhere in the docs in case someone is wondering.)

Actually, this is not about path dependencies, i.e. file dependencies and directory dependencies, but only about directory dependencies, isn't it? If that's the case we should change the wording. (I know that both are sometimes lumped together but there are already differences between them, because directory dependencies are more volatile than file dependencies, e.g. directory dependencies are always resolved when running poetry lock --no-update but file dependencies are not.)

@adriangb
Copy link
Contributor Author

adriangb commented Mar 6, 2023

If I understand correctly, the only use case to skip all path dependencies during installation is docker (or similar) caching. This is especially useful in a monorepo use case but is not restricted to monorepos.

Yep that’s right. I agree that Docker + monorepos is going to be the main use case but there could be others, and #6845 is not monorepo or Docker specific in any way.

Every time you run poetry install with this flag, you want to run poetry install without this flag later, don't you? Otherwise, you have an incomplete installation and stuff might fail. ☝️ We have to make this clear in the cli docs and warn that without a subsequent proper poetry install, you will end up with a incomplete/broken environment.

I wouldn’t call the environment “broken”, maybe “incomplete” but that depends on what your goal is. There’s already a precedent for this with —no-root which leaves your environment in a similar state. But more/better documentation is always good, so no reason not to expand on that!

Dependency groups can't be used for this use case, because there might be transitive path dependencies and non-path sub-dependencies of path dependencies that shouldn't be ignored in order to cache as much as possible. (Maybe, we should write that somewhere in the docs in case someone is wondering.)

Yup exactly. I tried to touch upon this in the docs included with #6845 but I’m happy to add more detail where needed or unclear.

Actually, this is not about path dependencies, i.e. file dependencies and directory dependencies, but only about directory dependencies, isn't it? If that's the case we should change the wording. (I know that both are sometimes lumped together but there are already differences between them, because directory dependencies are more volatile than file dependencies, e.g. directory dependencies are always resolved when running poetry lock --no-update but file dependencies are not.)

Yes this is only about path dependencies. I thought the name of the new option (—no-path) already makes that quite clear, but I’m happy to change things. Can you comment on #6845 wherever you think a wording change is necessary?

Regarding file dependencies, the reasons I excluded them from this work is:

  • I don’t know of any workflow that would benefit from skipping file dependencies
  • If such a workflow existed I’d argue it should be it’s own CLI flag and should be discussed separately; I don’t think it should block this option.
  • Like you say file dependencies are less volatile. If someone has a wheel file that always has the same name but the contents change that would be an anti pattern (eg PyPi won’t even let you upload duplicate wheels).

@radoering
Copy link
Member

Yes this is only about path dependencies
[...]

  • I don’t know of any workflow that would benefit from skipping file dependencies

My point is that a directory dependency is a path dependency and a file dependency is a path dependency, too.

<path> = <directory> | <file>

@adriangb
Copy link
Contributor Author

adriangb commented Mar 6, 2023

Good point. Any ideas for the right term then? --no-directory? --no-path-directory?

@radoering
Copy link
Member

--no-path-directory seems a bit redundant. I like --no-directory better. Maybe, just --no-dir. (There is a method is_dir() in pathlib, so this can be considered a common abbreviation.)

@adriangb
Copy link
Contributor Author

adriangb commented Mar 7, 2023

Let’s discuss more in #6845 since this is quite specific to that

@mwgamble
Copy link

@adriangb Yes, I believe that PR does solve my problems :) Thank you!

@jamesrobson-secondmind
Copy link

I was experimenting with this arrangement and there seems to be a potential problem with cli use of poetry. Using https://github.com/adriangb/python-monorepo, if you add pytest as a dev dependency in poetry/pyproject.toml you can run the following commands:

poetry run pytest workspaces/lib/tests/test_foo.py #will work
cd workspaces/lib/
poetry run pytest tests/test_foo.py #will fail

The failure will happen whether or not you have a poetry.toml with virtualenvs.create = false.

In a repo wil only 1 pyproject.toml file you can run commands in the virtual env from anywhere inside that repo, but with this arrangement you can only do so from the very top directory. People can probably get used to this, but it does seem like this would be a major source of complaints.

How hard would it be to add some mechanism to flag that a pyproject.toml file doesn’t have an associated virtual env and poetry needs to search higher up the directory tree?

@adriangb
Copy link
Contributor Author

pytest is particularly difficult to make work with monorepos, especially with a /test directory. See pytest-dev/pytest#9504. This is more of a pytest problem.

I’m not sure what error you get, is it because poetry creates a new virtual env in workspaces/lib/.venv and that one doesn’t have pytest?

We can make a plug-in that lets you do poetry workspaces —only lib run pytest or something like that and will be aware of the layout in the sense that it will use the top level virtual env but execute from the workspaces/lib folder. Lots of things we could do with a plug-in like that, e.g. an —isolated command that removes any deps not specific to that workspace. I thin

There is also something to be said for each workspace having all of the dependencies it needs, including pytest, so that you can always work on that single workspace as if it were it’s own non-workspace package.

@jamesrobson-secondmind
Copy link

I’m not sure what error you get, is it because poetry creates a new virtual env in workspaces/lib/.venv and that one doesn’t have pytest?

The specific error is

Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_foo.py:2: in <module>
    from namespace.lib import Foo
E   ModuleNotFoundError: No module named 'namespace'

This does look to be because poetry creates a virtual env in namespace/lib which has nothing installed in it. Interestingly you get the same error with virtualenvs.create = false, I'm less sure what happens in that case.

We can make a plug-in that lets you do poetry workspaces —only lib run pytest or something like that

For myself, we already use taskipy which lets you define tasks in your pyproject toml. So you can run poetry run task test_lib and have that task be pytest workspace/lib/tests or anything else you want. If you need to use a plugin then using an existing one seems simpler.
I'm still exploring options for the best way to work with the repo we have as it grows. But my thinking was having our CI process be something like, for each workspace run:

poetry install --only dev --only $workspace
poetry run task test_$workspace

which would provide isolation to ensure the listed dependencies for each workspace is correct. The concern I have is more about developers working in one workspace wanting to run say an individual test and tripping over the fact you need to add extra options depending on where you are.

There is also something to be said for each workspace having all of the dependencies it needs, including pytest, so that you can always work on that single workspace as if it were it’s own non-workspace package.

Yes. If you have a plugin that would create a virtual environment for each workspace and keep them up to date when changing the lockfile I think that would solve this problem. It would need that plugin though, otherwise you'd have to update each workspace manually.

@adriangb
Copy link
Contributor Author

adriangb commented Mar 17, 2023

I pushed up a Makefile that is similar to what I used to manage a ~12 workspace project. You can choose to work from the root repo using the make test-lib commands, running pipx run poetry run pytest workspaces/lib yourself or you can cd into a particular directory and manually use the venv one folder up (../.venv/bin/activate && pytests . sort of thing).

Locally we'd usually just have a single venv with all of the project installed. So you weren't really testing each one in isolation but you also didn't have to deal with multiple envs or syncing deps. I was also never able to get VSCode's pytest integration to work from the root repo so if I had to run tests in a debugger I'd have to cd into the project folder and manually select the top level virtual env.

In CI we ran each project in it's own job so we could test them in isolation. We had a pre-commit hook to template the list of workspaces into a YAML file which we then fed into GitLab CI to run a parameterized job who's run step was just make test-${WORKSPACE_NAME}.

It all worked pretty well, but I’m not sure this is the ideal or only workflow. Experimenting in a plug-in is the right next step.

@glebrh
Copy link

glebrh commented Oct 11, 2023

Hi @adriangb

First of all thanks for providing setup example and several PRs to make it work seamlessly!

I have a question (which seems to be not addressed in none of the issues where this conf is discussed): is there a good way to make poetry add work for subpackages?

  • Running poetry add within a subpackage folder results in creation of poetry.lock file inside the subpackage folder. This is not the desired way, since poetry.lock is supposed to reside inside the root.
  • Running poetry add -G subpackage updates pyproject.toml within the root, but not pyproject.toml within the subpackage folder.

Could you pls comment on that matter?

@adriangb
Copy link
Contributor Author

I think it'd be doable by making a poetry plugin. But then you'd have to do poetry workspace add requests / poetry workspaces subpackage add requests for your use cases respectively or something similar, so it's not going to be particularly pretty unless it's built in. For now I've found that putting poetry.lock files in a .gitignore, manually editing pyproject.toml files and/or cd'ing to the appropriate directory works fine.

@sfriedowitz
Copy link

sfriedowitz commented Jan 12, 2024

@adriangb is there any update here on how to build wheels for subprojects that contain path dependencies?

A use case we have is to be able to publish a subproject that depends on another subproject (via a path dependency) maintained in the monorepo.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Feature requests/implementations status/triage This issue needs to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants