Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import mechanism for task modules. #373

Merged
merged 16 commits into from May 23, 2023

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented May 2, 2023

Closes #374.

Corresponding issue in pytest: https://github.com/pytest-dev/pytest/issues/7856

I get this obscure error from within dataclasses:

────────────────────────── Failures during collection ──────────────────────────
───────── Could not collect test_collect_module_name_0/task_module.py ──────────

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ in exec_module:883                                                           │
│ in _call_with_frames_removed:241                                             │
│                                                                              │
│ /private/var/folders/nz/wfwcn6md1796t48tx9dhf0v40000gn/T/pytest-of-nickcrews │
│ /pytest-41/test_collect_module_name_0/task_module.py:7 in <module>           │
│                                                                              │
│   4 import dataclasses                                                       │
│   5                                                                          │
│   6 @dataclasses.dataclass                                                   │
│ ❱ 7 class Data:                                                              │
│   8 │   x: int                                                               │
│   9                                                                          │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1263   │
│ in dataclass                                                                 │
│                                                                              │
│   1260 │   │   return wrap                                                   │
│   1261 │                                                                     │
│   1262 │   # We're called as @dataclass without parens.                      │
│ ❱ 1263 │   return wrap(cls)                                                  │
│   1264                                                                       │
│   1265                                                                       │
│   1266 def fields(class_or_instance):                                        │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1253   │
│ in wrap                                                                      │
│                                                                              │
│   1250 │   """                                                               │
│   1251 │                                                                     │
│   1252 │   def wrap(cls):                                                    │
│ ❱ 1253 │   │   return _process_class(                                        │
│   1254 │   │   │   cls, init, repr, eq, order, unsafe_hash, frozen, match_ar │
│   1255 │   │   )                                                             │
│   1256                                                                       │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1003   │
│ in _process_class                                                            │
│                                                                              │
│   1000 │   │   # See if this is a marker to change the value of kw_only.     │
│   1001 │   │   if _is_kw_only(type, dataclasses) or (                        │
│   1002 │   │   │   isinstance(type, str)                                     │
│ ❱ 1003 │   │   │   and _is_type(type, cls, dataclasses, dataclasses.KW_ONLY, │
│   1004 │   │   ):                                                            │
│   1005 │   │   │   # Switch the default to kw_only=True, and ignore this     │
│   1006 │   │   │   # annotation: it's not a real field.                      │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:764 in │
│ _is_type                                                                     │
│                                                                              │
│    761 │   │   if not module_name:                                           │
│    762 │   │   │   # No module name, assume the class's module did           │
│    763 │   │   │   # "from dataclasses import InitVar".                      │
│ ❱  764 │   │   │   ns = sys.modules.get(cls.__module__).__dict__             │
│    765 │   │   else:                                                         │
│    766 │   │   │   # Look up module_name in the class's module.              │
│    767 │   │   │   module = sys.modules.get(cls.__module__)                  │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute '__dict__'

Changes

Provide a description and/or bullet points to describe the changes in this PR.

Todo

  • Reference issues which can be closed due to this PR with "Closes #x".
  • Review whether the documentation needs to be updated.
  • Document PR in docs/changes.rst.

I get this obscure error from within dataclasses:

────────────────────────── Failures during collection ──────────────────────────
───────── Could not collect test_collect_module_name_0/task_module.py ──────────

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ in exec_module:883                                                           │
│ in _call_with_frames_removed:241                                             │
│                                                                              │
│ /private/var/folders/nz/wfwcn6md1796t48tx9dhf0v40000gn/T/pytest-of-nickcrews │
│ /pytest-41/test_collect_module_name_0/task_module.py:7 in <module>           │
│                                                                              │
│   4 import dataclasses                                                       │
│   5                                                                          │
│   6 @dataclasses.dataclass                                                   │
│ ❱ 7 class Data:                                                              │
│   8 │   x: int                                                               │
│   9                                                                          │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1263   │
│ in dataclass                                                                 │
│                                                                              │
│   1260 │   │   return wrap                                                   │
│   1261 │                                                                     │
│   1262 │   # We're called as @DataClass without parens.                      │
│ ❱ 1263 │   return wrap(cls)                                                  │
│   1264                                                                       │
│   1265                                                                       │
│   1266 def fields(class_or_instance):                                        │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1253   │
│ in wrap                                                                      │
│                                                                              │
│   1250 │   """                                                               │
│   1251 │                                                                     │
│   1252 │   def wrap(cls):                                                    │
│ ❱ 1253 │   │   return _process_class(                                        │
│   1254 │   │   │   cls, init, repr, eq, order, unsafe_hash, frozen, match_ar │
│   1255 │   │   )                                                             │
│   1256                                                                       │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:1003   │
│ in _process_class                                                            │
│                                                                              │
│   1000 │   │   # See if this is a marker to change the value of kw_only.     │
│   1001 │   │   if _is_kw_only(type, dataclasses) or (                        │
│   1002 │   │   │   isinstance(type, str)                                     │
│ ❱ 1003 │   │   │   and _is_type(type, cls, dataclasses, dataclasses.KW_ONLY, │
│   1004 │   │   ):                                                            │
│   1005 │   │   │   # Switch the default to kw_only=True, and ignore this     │
│   1006 │   │   │   # annotation: it's not a real field.                      │
│                                                                              │
│ /Users/nickcrews/.pyenv/versions/3.10.4/lib/python3.10/dataclasses.py:764 in │
│ _is_type                                                                     │
│                                                                              │
│    761 │   │   if not module_name:                                           │
│    762 │   │   │   # No module name, assume the class's module did           │
│    763 │   │   │   # "from dataclasses import InitVar".                      │
│ ❱  764 │   │   │   ns = sys.modules.get(cls.__module__).__dict__             │
│    765 │   │   else:                                                         │
│    766 │   │   │   # Look up module_name in the class's module.              │
│    767 │   │   │   module = sys.modules.get(cls.__module__)                  │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute '__dict__'
@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #373 (b25751e) into main (836518a) will increase coverage by 0.06%.
The diff coverage is 99.28%.

❗ Current head b25751e differs from pull request most recent head 9927ff3. Consider uploading reports for the commit 9927ff3 to get more accurate results

@@            Coverage Diff             @@
##             main     #373      +/-   ##
==========================================
+ Coverage   96.09%   96.16%   +0.06%     
==========================================
  Files          93       93              
  Lines        7664     7797     +133     
==========================================
+ Hits         7365     7498     +133     
  Misses        299      299              
Flag Coverage Δ
end_to_end 82.66% <47.85%> (-0.64%) ⬇️
integration 40.95% <19.28%> (-0.38%) ⬇️
unit 66.60% <88.57%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/_pytask/path.py 95.52% <97.05%> (+1.58%) ⬆️
src/_pytask/collect.py 95.32% <100.00%> (+0.46%) ⬆️
tests/test_collect.py 94.92% <100.00%> (+0.71%) ⬆️
tests/test_debugging.py 89.92% <100.00%> (+0.03%) ⬆️
tests/test_path.py 95.12% <100.00%> (+10.91%) ⬆️

@tobiasraabe tobiasraabe changed the title test: add failing test for module name Fix import mechanism for task modules. May 6, 2023
@tobiasraabe
Copy link
Member

Hi @NickCrews, do the changes solve your issue?

@NickCrews
Copy link
Contributor Author

Thanks for the quick turnaround! This fixes that exact problem, but introduces a new one:

Say I have mypkg/io.py. Then this will set sys.modules["io"] = <my module> shadowing the builtin module io, and then any other code that does import io will import the wrong thing and break.

I'm not exactly sure, but should we be doing what pytest does, and search up the directory tree until we don't find an __init__.py, and add that root dir to sys.path, then mypkg/io.py would be imported as mypkg.io (which is what the rest of my application code does, which is correct).

@tobiasraabe
Copy link
Member

You are right. This was too quick. Thanks for the comment.

To comment on your example with the io module, it would only concern task_modules, and I would consider it best practices to not import from task modules because it can easily lead to circular import errors. (I should check again whether it is documented.)

The approach implemented in pytask is closest to the importlib import mode. This is also the import mode pytest recommends for new projects and drawbacks and advantages are also discussed here.

The major drawback that I see is that task modules need to have unique names like task_module or they need to have unique names based on the package they are embedded in, like project.data_management.task_module. I assume this is not a real problem for existing pytask projects since the recommendation is to create projects based on the src layout.

The benefit of this approach is that we do not need to change sys.path at all, which introduces a lot of complicated logic.

If we follow this approach, I think three things must be done.

  • Add the root_path argument to import_path: link
  • Create the module name from the path: link
  • Insert missing modules: link

@NickCrews, how does it sound to you? Since you mentioned the prepend and append modes, do you see a use case where the importlib mode does not achieve what you want and where we need more flexibility?

@NickCrews
Copy link
Contributor Author

NickCrews commented May 17, 2023

Thanks for the patience, and thanks for going ahead and implementing something concrete. I was having trouble understanding exactly what you were proposing until I read those commits.

I'll leave a few nits as review but in general looks good, thank you!

The major drawback that I see is that task modules need to have unique names like task_module

You're saying this is the drawback of prepend and append, right? Because this is NOT a requirement for importlib, what you have implemented here. I think we should add a test for this.

best practices to not import from task modules because it can easily lead to circular import errors

I respectfully disagree. I would like to have my tasks mixed with my "main logic" code. I have a layout where each .py files represents one discrete step in processing. I want each step to export 3 different interfaces:

  1. A public python API, eg def clean_names(names: pd.DataFrame) -> pd.DataFrame
  2. A task that wraps this python API eg def task_clean_names(....), so that this step can be orchestrated with all the other steps in a larger pipeline.
  3. A CLI to run this single step of processing, which loads/saves based on the paths specified in tast_clean_names

I want all three of these related functionalities to be in one file. This requires that one module containing tasks needs to import other modules containing tasks. Now, this actually works fine right now and I have no complaints! I just need to import as from mypkg.names.clean import clean_names and then make sure that mypkg is installed/discoverable, but this is standard stuff outside of pytask. But I just wanted to flag that I don't think we should assume/advocate for keeping tasks outside of app code. IDK, maybe I just haven't run into this circ dependency issue yet.

I think avoiding adjusting sys.path is a very good goal.

This implementation worked fine for my project. I could see this still not covering someone else's use case, but I think this is at least moving in the right direction, and I don't think should break anyone (unless they are relying on something weird/fragile with the value of __name__).

src/_pytask/path.py Outdated Show resolved Hide resolved
src/_pytask/path.py Outdated Show resolved Hide resolved
tests/test_path.py Outdated Show resolved Hide resolved
tests/test_path.py Outdated Show resolved Hide resolved
tests/test_path.py Outdated Show resolved Hide resolved
tests/test_path.py Outdated Show resolved Hide resolved
tests/test_collect.py Outdated Show resolved Hide resolved
@tobiasraabe
Copy link
Member

Hi @NickCrews, thank you for your review 🙏. Unfortunately, I won't be able to answer you soon, but I will return to you on Monday.

@tobiasraabe
Copy link
Member

Thanks for staying with me!

You're saying this is the drawback of prepend and append, right? Because this is NOT a requirement for importlib, what you have implemented here. I think we should add a test for this.

You are right. I misremembered when I wrote it.

I respectfully disagree. I would like to have my tasks mixed with my "main logic" code...

I should have been more precise since I do not disagree with your setup. Nor do I want to strictly separate app and task code. I was thinking more about situations like those described in this article of the documentation.

In this example, general functions that provide access to the inputs and outputs of a stage like data_preparation are stored in config.py files. Then, the following step, analysis can access the results of the previous stage by importing path_to_processed_data from my_project.data_preparation.config instead of my_project.data_preparation.task_prepare_data.

You don't need to do that, as you said. Though, in bigger projects, this recommendation has helped beginners to care more about code separation, avoiding long modules, and by that also avoiding circular imports.

It might be misguided to say it is best practice because it concerns this specific case and is only documented in one article.

I will integrate the proposed changes soon and, then, I will release a new version as well.

@tobiasraabe tobiasraabe merged commit 9e6be35 into pytask-dev:main May 23, 2023
17 checks passed
@tobiasraabe tobiasraabe added this to the v0.3.2 milestone May 23, 2023
@NickCrews
Copy link
Contributor Author

This looks great, thank you so much for your work here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: module.__name__ not set set correctly during exec in test collection
2 participants