Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

yury-fedotov
Copy link

@yury-fedotov yury-fedotov commented May 6, 2024

Description

Closes #3418

Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.

P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.

Development notes

  • Built docs locally to test that hyperlinks and formatting work as expected.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@yury-fedotov yury-fedotov marked this pull request as ready for review May 6, 2024 06:13
@yury-fedotov
Copy link
Author

@yury-fedotov
Copy link
Author

@noklam, @astrojuanlu, could you please have a look?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.

`__main__.py` file of your project, but such advanced customizations are not in scope of this article.
```

This being the only constraint means that you can, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if the points below are linked to the section that describe them in more detail. A bit like how we setup the configuration page with "how to.." https://kedro--3852.org.readthedocs.build/en/3852/configuration/configuration_basics.html#how-to-use-kedro-configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Merel, great suggestion, just implemented this. Rendered build here. Let me know if that's what you had in mind.

yury-fedotov and others added 9 commits May 16, 2024 19:31
Comment on customizations

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
For example in splitting

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Make provided examples title not bold

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
And as where appropriate

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Organise spelling

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Proper wording for pip install note

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Implies to consists replacement

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>

# Conflicts:
#	RELEASE.md
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@yury-fedotov
Copy link
Author

Hey @merelcht - thanks for comments again, I just finished implementing them.

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

@merelcht
Copy link
Member

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for the contribution!

It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

Comment on lines +8 to +10
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located only in those starter files, which is not true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md?
Or to make content of comments on that issue part of this new section of docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I am not the best person to ask for English but I'll try my best🤓

While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.

I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline use that convention), but not mandatory to name files pipeline.py

```

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam how would you then call a .py file that is not nodes.py or pipeline.py for the purpose of this example? I was thinking of dataframe_utils.py, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is many thing that I don't like about utils (just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.

Even if we go with utils. I will strip the _utils suffix, it feels redundant to have pandas_utils.py under utils.py. Then in the code I will probably do from <pacakge> import utils. When I need to use it, I will use utils.pandas.func to make it clear that this is a util namespace but not pandas.

visualitization_utils.py could just be visualisation module itself.
(all above are subjective)

I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that exactly what I'm adding there?

Screenshot 2024-05-25 at 12 47 31 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code

Comment on lines 43 to 44
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
into `historical_pipeline.py` and `inference_pipeline.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we usually go for the namespace/modular structure more:

  • pipelines
    • historical
    • inference
      • pipeline.py

@astrojuanlu thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that historical vs inference sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:

If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.

Is that better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

into `historical_pipeline.py` and `inference_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are:

  • is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?

@stichbury
Copy link
Contributor

I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it.

Just ping me again when you decide how to go forward!

@merelcht
Copy link
Member

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.

I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.

@stichbury
Copy link
Contributor

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready.

@yury-fedotov
Copy link
Author

Thanks for the comments team! I'll get to them next week and come back.

Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>

# Conflicts:
#	RELEASE.md
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@stichbury
Copy link
Contributor

@yury-fedotov @astrojuanlu Can you ping me when this is ready for my review and I'll get you any final feedback on the candidate draft.

@astrojuanlu
Copy link
Member

There are a couple of outstanding comments waiting for response from @noklam and after that I think we should be good for a next round of reviews. Thanks for your patience @yury-fedotov 🙏🏼

@yury-fedotov
Copy link
Author

I'll ping you too @stichbury . It's not ready yet.

@noklam noklam requested a review from datajoely May 29, 2024 16:10
@datajoely
Copy link
Contributor

So I'm just seeing this for the fist time. Thank you for the push @yury-fedotov. I feel quite strongly that we're overcomplicating things:

Problem:

  • Business logic grows in complexity and should live and be tested outside of your core Kedro project.
  • You should also be aspiring to achieve the general principle of loose coupling, high cohesion
  • Python paths and packaging in general is a pain for many novice programmers.

What we're talking about here isn't Kedro specific at all, so there's an argument we shouldn't be writing extensive docs and perhaps would be better placed to link to relevant resources.

If we are to provide a solution, this would be my approach:

  • This is a solved problem at QuantumBlack with tooling to create and manage python packages within a monrepo. The monorepo section as it's written today touches on some good points, but in my opinion we should pivot this section into explaining to users how to achieve this best, the whole section about pipeline_registry.py and settings.py feels way off piste to me.
    • We should explain how to create a minimal python package, the Dagster folks have some great accessible resources here (1) (2)
    • We should explain the virtues of writing and testing business logic, uncoupled from pipeline flow logic.
    • This is the pattern we should be encouraging and introducing concepts like pip install path/to/package -e feels like something we should be explaining here.

@noklam noklam requested a review from astrojuanlu May 29, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Describe how to use custom code with Kedro
6 participants