Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing detail in grid documentation #973

Open
edublancas opened this issue Aug 9, 2022 · 10 comments
Open

missing detail in grid documentation #973

edublancas opened this issue Aug 9, 2022 · 10 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@edublancas
Copy link
Contributor

in the docs we have an example that looks like this:

tasks:
  - source: random-forest.py
    name: random-forest-
    product: 'n_estimators=[[n_estimators]]/criterion=[[criterion]].html'
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

link

the `[[placeholders]]`` are replaced at runtime by the parameter values, however, to ensure each file has a different name we also append a number at the end; but this isn't documented.

right below the snippet that I put above, we should add a little note saying that we number the filenames to prevent conflicting paths.

@edublancas edublancas added documentation Improvements or additions to documentation med priority labels Aug 9, 2022
@lukesmurray
Copy link

don't know if there is an issue for it, but would be great to remove the numbers altogether with a flag, especially when we want to send data to other processes and need to know the filenames.

@edublancas
Copy link
Contributor Author

Yeah, this makes sense. the filenames should be predictable, I'm deciding between two solutions:

  1. explicit config, something like:
tasks:
  - source: random-forest.py
    name: random-forest-
    product: 'n_estimators=[[n_estimators]]/criterion=[[criterion]].html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]
  1. automatically turning it off if it detects that there are [[placeholders]] in product:
tasks:
  - source: random-forest.py
    name: random-forest-
    # no numbering here
    product: 'n_estimators=[[n_estimators]]/criterion=[[criterion]].html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]
tasks:
  - source: random-forest.py
    name: random-forest-
    # numbering here
    product: 'report.html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

option 2 is nice since users won't need to read docs, but it'll often lead to issues if the tasks in the grid produce non-unique products (ploomber does not allow this since this will cause tasks to overwrite their results hence it raises an error)

thoughts?

@lukesmurray
Copy link

I think the second option with a reasonable error is better. Although I assume you meant to remove grid_number_suffix in the second option. You could actually have grid_number_suffix as an option and then have an error message like.

Unable to resolve pipeline. Two products contain identical names.
     task: first task
         products/foo/bar/baz
     task: second task
         products/foo/bar/baz
     products/foo/bar/baz

You can change the product names, or add additional [[placeholders]] to uniquely identify these products
or set `grid_number_suffix: true` to automatically add numbered suffixes to these products to make them
unique.

@edublancas
Copy link
Contributor Author

Ah, great catch! Yeah, for the second scenario I meant removing the grid_number_suffix, but I like your approach of having that option as well and showing it as part of the error message, so users don't have to dig into the docs.

I'll add this to the backlog.

thanks a lot for your feedback!

@Wxl19980214
Copy link
Contributor

Ok so what's left to do here? Are we settle on an option yet?

@edublancas
Copy link
Contributor Author

I think the second option with a reasonable error is better. Although I assume you meant to remove grid_number_suffix in the second option. You could actually have grid_number_suffix as an option and then have an error message like.

Unable to resolve pipeline. Two products contain identical names.
     task: first task
         products/foo/bar/baz
     task: second task
         products/foo/bar/baz
     products/foo/bar/baz

You can change the product names, or add additional [[placeholders]] to uniquely identify these products
or set `grid_number_suffix: true` to automatically add numbered suffixes to these products to make them
unique.

let's go with this @Wxl19980214

@Wxl19980214
Copy link
Contributor

So what we want to do is to take in another parameter grid_number_suffix right? Which is optional, if a user doesn't provide it and run with grid, we output above error? And if he turn this on, we automatically add suffix to it like we are doing now?

@Wxl19980214
Copy link
Contributor

Let me just clarify things a bit. So this functionality only applies to grid usage? And the premise is that user doesn't look at our docs so they don't know placeholders, all they do is something like

tasks:
   source: "script.py"
   product: "report.html"
   grid: {
     param : [1,2]
  }

So if they do this, we output error saying you can either give placeholders like report-param=[param].html, or add the grid_number_suffix = True so ploomber automatically add the suffix to make it report-1.html and report-2.html?

@Wxl19980214
Copy link
Contributor

What about this case:

tasks:
  - source: random-forest.py
    name: random-forest-
    # no numbering here
    product: 'n_estimators=[[n_estimators]]/criterion=[[criterion]].html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

So the products will have different names, but the shouldnt we add number suffix to names: like random-forest-0, etc?

@edublancas
Copy link
Contributor Author

edublancas commented Aug 19, 2022

Yeah, you got it right. Let's start with the main motivation for this: pipeline task's should produce unique output files, so if I do:

tasks:
  - source: random-forest.py
    name: random-forest-
    # no numbering here
    product: output.html
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

ploomber will complain since all tasks generated from the grid will create an "output.html". so the initial solution was to automatically number outputs: output-1.html, output-2.html...

Recently we added the [[placeholder]] options. They were initially intended to help users find which files were generated by which set of parameters but they can also be used to identify outputs uniquely:

tasks:
  - source: random-forest.py
    name: random-forest-
    product: 'n_estimators=[[n_estimators]]/criterion=[[criterion]].html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

but this wont work if there are missing parameters:

tasks:
  - source: random-forest.py
    name: random-forest-
    # n_estimators is missing!
    product: 'criterion=[[criterion]].html'
    grid_number_suffix: false
    grid:
        n_estimators: [5, 10, 20]
        criterion: [gini, entropy]

So the proposed solution is:

if we detect non-unique products, then we tell the user to either turn grid_number_suffix: true or to add all the [[placeholders]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
3 participants