Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow output definitions and schema #4670

Closed
bentsherman opened this issue Jan 17, 2024 · 45 comments · Fixed by #4784
Closed

Workflow output definitions and schema #4670

bentsherman opened this issue Jan 17, 2024 · 45 comments · Fixed by #4784

Comments

@bentsherman
Copy link
Member

Spun off from #4042 (comment)

Currently, the publishDir directive is used to define "workflow outputs". It is a "push" model in which individual processes are responsible for deciding which of their outputs should be workflow outputs, and how to publish them. It was created in the days of DSL1 when there was no concept of workflows / subworkflows.

With DSL2 and the introduction of workflows, publishDir is an unwieldy way to define workflow outputs. For one thing, it is extremely difficult to get a full picture of a workflow's outputs because they are scattered across the process definitions. Additionally, a process could be invoked many different times by different workflows, and each workflow might have a different idea of its own outputs. All in all, it no longer makes sense to define workflow outputs in the process definition.

Instead, it should be possible to specify an output schema for a workflow, which defines the full set of published outputs, as well as metadata, in terms of the processes (and subworkflows) that the workflow invokes.

The two options discussed so far are:

  1. define the workflow outputs as part of the workflow definition (e.g. an output block) which can reference the outputs of subcomponents
  2. define the workflow outputs in a separate file similar to what is done for nf-core modules

See also #4661 for discussion of where to put additional publish options such as mode and saveAs. The publish mode could become a config option or process directive (e.g. publishOptions.mode) to allow for process-specific configuration, or they could become part of the output schema.

@pditommaso pditommaso changed the title Workflow outputs schema Workflow output definitions and schema Feb 21, 2024
@pditommaso
Copy link
Member

pditommaso commented Feb 21, 2024

I see like this:

  1. Adopt the meta.yml to define type of each process input & output. Currently, basic types are used but it could be extended to support structured used-defined types. Nextflow can read all of these and it can create a runtime view of the types and validate them before running the workflow.
  2. Add a (top) level DSL to define the workflow outputs as a replacement for publishDir. This should allow to define and keep in a single place the workflow output definition and configuration (more below_
  3. Add an (*optional*) output schema having a similar structure to the current input schema. When existing the output definition (2) much match the schema definition.

Modelling workflow outputs

The point at 2 is the core. The goal is to provide a way to define the structure of the output that be produced by a pipeline execution in a concise and expressive way, and keep this into a single place.

A core pattern that needs to be addressed is how to define a semantic-directory structure collection of some or all workflow output files and give the ability to create (optionally) one or more index files; that's another very common practice.

For the sake of this example, let's borrow the recent introduced topic channel. This allows collecting all output files created during the pipeline execution having the same semantic (or just topic name).

We could use this concept (syntax purely hypothetical) to write something like this

workflow {
  rnaseq(params.something) 
} 

output {
  directory = params.outdir 

   samples {
      topic 'someName'
      path 'subdir/where/store/the/files'
      pattern '*.assembly.fa.gz' 
      index {
         type 'csv'
         path 'where/to/store/samples.csv'
      }
   }

   moreHere {
     ..
   }
}

The type information defined at 1 (i.e. meta.yml) and associated with the specified topic could be used to define and render the structure of the index file and carry out other type validation.

The overall output definition could be validated against the output JSON schema if provided

@bentsherman
Copy link
Member Author

I like the idea of using the channel topics to hold workflow outputs.

One thing that's missing from your example is the metadata. Once we have record types this will be easier, but for now I think we'll need some way to translate the standard tuple-with-meta-map pattern into a record for the index file, so that the metadata can be saved as columns.

I think we can add some config to this output DSL to control this mapping, with some sensible default for the most common use case.

@pditommaso
Copy link
Member

One thing that's missing from your example is the metadata

I believe this needs to be inferred from the meta.yml (when provided), but yes, it could be done in a separate effort.

@bentsherman
Copy link
Member Author

I feel like the meta.yml inputs and outputs are mostly duplicating the process definition though. I would want the type information to be defined in the process

@marcodelapierre
Copy link
Member

From the chat Paolo and I had on Monday, I would say one key aspect of point 1 on meta.yml is the module-level character of some key information: list of inputs/outputs, their types, corresponding file patterns, eventually other related metadata.

This work package to upgrade the definition of outputs at the workflow-level should leave that info at module level, and ensure it is defined in such a way that can then be processed to match the workflow level output definition (point 2).

I agree, Ben, that ideally the output types should be in the process def. For point 2. above, probably everything could then be sourced from the process def, I wonder.

--

However, there is also point 3 on the output schema. Again, to preserve modularity, key process-level information should be defined at module-level, gathered and processed together with knowledge from 2. to produce the workflow output schema. Where would this module-level info go? The process definition would probably become too cluttered, and that's where the meta.yml is handy.

And here comes a second key point that emerged during the Monday chat: is JSON a good format for these schemas? do we have room to consider alternatives? (nf-core requirements?) YAML could be a more readable and approachable alternative to pipeline devs.

--

Question to close, going back to Ben's comments: shall we prioritise work to support Types in the process definition as a prerequisite to this one? Or can we start working on the re-design of the workflow-level output definition (point 2.), and defer Type support to later?

@pditommaso
Copy link
Member

I feel like the meta.yml inputs and outputs are mostly duplicating the process definition though. I would want the type information to be defined in the process

I agree with that, but having a static type system in the language will require some time. My proposal is to rely on types the meta.yml (and other metadata) until we don't a proper type system in the language, and make sure it will be possible to transition from a file base metadata to a fully static lang check in the future.

However, there is also point 3 on the output schema. Again, to preserve modularity, key process-level information should be defined at module-level, gathered and processed together with knowledge from 2. to produce the workflow output schema. Where would this module-level info go?

I believe the approach proposed allows us to combine modularity with overall output and schema definition.

shall we prioritise work to support Types in the process definition as a prerequisite to this one?

Think I have answered above

@bentsherman
Copy link
Member Author

Honestly I would rather not deal with types at all. I thought we agreed it was a separate effort from workflow outputs.

Also, if we use channel topics to facilitate the workflow outputs then we don't really need the meta.yml, because the process definition will specify the target topic for each output.

The 99% use case for metadata is as a tuple with a meta map as the first element. When an output is structured in this way, it is trivial to transform it into a record for the index file.

@pditommaso
Copy link
Member

Honestly I would rather not deal with types at all. I thought we agreed it was a separate effort from workflow outputs.

Agree, I'll open a separate ticket to track it independently

@bentsherman
Copy link
Member Author

Okay I think I have enough clarity now to draft a prototype for the outputs DSL proposed by Paolo. I will aim to have it ready by the end of next week.

@marcodelapierre
Copy link
Member

A comment on Paolo's concept above.

Do we need to use the topic? or could we instead name a list of those processes, that are part of the workflow and that create the files that may match the pattern for that specific output category?
It would make the overall syntax more concise, by reusing process names instead of having to add topic syntax across the processes.

@bentsherman
Copy link
Member Author

Well, I really like the topic approach because it's intuitive and should be easy for users to transition to it from publishDir. Of course, workflow outputs are still defined in part at the process level, but not nearly as much as before.

With this outputs DSL, I could search the topic name over the project directory and get a view of the processes that push to that topic, whereas currently I have no such top-level identifier.

One issue with listing the processes is that we need to target specific outputs, not just the process. We could technically do this with the emit names, but this is much more verbose and I'm willing to bet that many users would lean towards using the topics.

Let's try the topic approach (should be easy to implement) and see how it looks with rnaseq

@marcodelapierre
Copy link
Member

marcodelapierre commented Feb 26, 2024

Ben, thank you for articulating further, helpful for the conversation. I agree with your points, let's start with the topic approach.

I will still share an extra thought that is secondary to the first draft, but I think useful in view of 2nd pass refinements, and to dicuss the broader context.

Default output topic: what about having a workflow always implicitly defining a defaultOutput topic, where all outputs of all processes go? This would constitute a baseline default, useful for beginners/simple pipelines, to keep the minimal syntax simple.

Along this line, we could define a few output internal defaults in Nextflow:

  • output.<OutputCollection>.topic 'defaultTopic'
  • output.<OutputCollection>.pattern null # has to be specified, otherwise no file published
  • output.<OutputCollection>.path '.'
  • output.directory '.'

So, keeping in mind Paolo's original example from above (output index omitted):

output {
  directory = params.outdir 

   samples {
      topic 'someName'
      path 'subdir/where/store/the/files'
      pattern '*.assembly.fa.gz' 
   }

   moreHere {
     ..
   }
}

We could provide a minimal output specification that leverages defaults:

output {
  samples {
      pattern '*.assembly.fa.gz' 
   }

   moreHere {
     ..
   }
}

In most production pipelines people would also resort to multiple topics and paths to manage and structure the output, as described above. But this would be building on top of defaults that enable an easy start for simple cases.

@pditommaso
Copy link
Member

what about having a workflow always implicitly defining a defaultOutput topic,

Essentially a kind of implicit topic. I though the same, in any case let's draft prototype using explicit topic names, we can decide later how to refine it

@bentsherman
Copy link
Member Author

I have hit an issue with the topic approach. If a process is imported multiple times e.g. under different aliases or by different workflows, there is no way to distinguish them by topic, because the topic mapping is defined in the process definition.

For example, the SALMON_QUANT process is used by two different workflows in rnaseq, FASTQ_SUBSAMPLE_FQ_SALMON and QUANTIFY_PSEUDO_ALIGNMENT. The outputs of this process are publish in the second case but not the first. But there is no way to implement this selective behavior with a topic because it is defined once for SALMON_QUANT.

It seems like the topic approach won't work because it cannot be as precise as process selectors which are used heavily with publishDir.

@bentsherman
Copy link
Member Author

@marcodelapierre you might have had the right idea. Adapting the DSL example to use process selectors:

// nextflow.config
process {
    withName: 'GUNZIP_.*|MAKE_TRANSCRIPTS_FASTA' {
        publishDir = [
            path: { params.save_reference ? "${params.outdir}/genome" : params.outdir },
            mode: params.publish_dir_mode,
            saveAs: { filename -> (filename != 'versions.yml' && params.save_reference) ? filename : null }
        ]
    }
}
// main.nf
output {
    genome {
        withName 'GUNZIP_.*'
        withName 'MAKE_TRANSCRIPTS_FASTA'
        path 'genome'
        // pattern '...'
        // index { ... }
    }
}

We can allow multiple withNames to break up the use of | in config selectors, and we can use pattern to select specific outputs. So it's not as verbose as I thought

@bentsherman
Copy link
Member Author

Working through rnaseq with this pattern, I have essentially re-created the YAML output schema that I showed previously, but in the DSL:

output {
    aligner {
        path params.aligner
        withName '.*:QUANTIFY_STAR_SALMON:SALMON_QUANT'
        withName '.*:QUANTIFY_STAR_SALMON:CUSTOM_TX2GENE'
        withName '.*:QUANTIFY_STAR_SALMON:TXIMETA_TXIMPORT'
        withName '.*:QUANTIFY_STAR_SALMON:SE_.*'
        withName 'NFCORE_RNASEQ:RNASEQ:SAMTOOLS_SORT' {
            when params.save_align_intermeds || params.save_umi_intermeds
            pattern '*.bam'
        }
        withName 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON' {
            when params.save_align_intermeds || params.save_umi_intermeds
            pattern '*.bam'
        }
        withName 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON' {
            when params.save_align_intermeds || params.save_umi_intermeds
            subpath 'umitools/log'
            pattern '*.log'
        }
        // ...
    }
}

@pditommaso
Copy link
Member

Tricky point. I don't love the use of withName to solve this. It adds too much noise in the DSL. I'd suggest to go ahead with the POC, and keep as a point to be addressed later.

@marcodelapierre
Copy link
Member

you might have had the right idea

admittedly I hadn't foreseen the topics vs aliases clash 😅 .. but glad to have given you some insipration 😉

@marcodelapierre
Copy link
Member

@pditommaso 👍 let's go ahead with the POC and evaluate the withName issue later.

We do need a way to match output patterns to appropriate processes in the workflows .. do you reckon using withName is sort of overloading their original purpose too much?

@pditommaso
Copy link
Member

Yeah, I think we need some kind labeling/annotation system to handle this. It may overlap with the #4510 effort

@bentsherman
Copy link
Member Author

This is not something that can be pushed off for later. nf-core pipelines will not use it if they can't differentiate between process inclusions.

On top of that, specific processes have some specific config like toggling based on a param which cannot be expressed with topics.

The thing about noise is that rnaseq will be complicated no matter how you slice it. It's just a really big pipeline. The metric I would use instead is, what is the essential information that needs to be expressed and what is the simplest way to express it?

So, we need to be able to capture specific outputs of specific process invocations. The pull approach would be to use process selectors and file patterns as shown above. The push approach would be... you would have to apply some annotations to the process invocation in the workflow. That strikes me as being worse than the process selectors, but I'm happy to be proven wrong.

To Paolo's point about module config, this could simplify things by breaking up the output definition into modules, just as I have done with the config files. Each subworkflow could define it's own outputs (now feasible with process selectors). But I'll have to think about how to handle composition and overrides.

I initially used select instead of withName but ultimately switched to the latter to make it clear that it was a process selector, but I kind of like select better in this context.

@pditommaso
Copy link
Member

I agree that's an important point for real-world pipelines, but I don't not agree that's necessary for a POC to validate the concept of

  1. having a dsl that describes the workflow outputs
  2. creates a semantic directory organisation
  3. creates the corresponding index files
  4. possibly validates against the output schema file

@bentsherman
Copy link
Member Author

Well if it makes no difference for the POC, the two approaches are about the same difficulty to implement, so I will use an approach that is not so likely to be immediately discarded

@marcodelapierre
Copy link
Member

marcodelapierre commented Feb 28, 2024

The withName select approach has some gray areas to be resolved but is of general applicability, whereas the topic approach has clear limitations that we know will make it unsuitable for the final design.

So, I would suggest we go ahead with the POC by using withName.
As agreed, knowing that its use will need later review.

@stevekm
Copy link
Contributor

stevekm commented Mar 7, 2024

I just submitted this Issue here for nf-prov nextflow-io/nf-prov#23 which I think is actually kinda related to the subject discussed here, in that I am looking for ways to preserve the workflow output schema metadata so that it can be accessed after the workflow is completed, such as by a chained follow-up pipeline.

So I am wondering if the approaches proposed in here would be amenable to saving the final metadata about published files into something like the manifest JSON?

on a separate note, I was also planning to investigate moving all of the publishDir directives from my pipeline into a single config file, instead of having them scattered across all the individual module files, for the exact reasons described here as well

it is extremely difficult to get a full picture of a workflow's outputs because they are scattered across the process definitions.

so I am also wondering if the proposed methods here would accomplish this and make it easier to put all these in a single config file?

I also noticed this discussion here #2723 which suggests that there could be major changes to the config file format, so I cant help but wonder how that might also play into this.

due to the huge number of different files being output by pipelines, it seems like documentation always ends up being needed to be written just to describe what a pipeline's output files are. Maybe there's some way that a dedicated schema definition used internally by Nextflow could be easily exported and turned into the starting point of such "pipeline output file docs"?

@pditommaso
Copy link
Member

So I am wondering if the approaches proposed in here would be amenable to saving the final metadata about published files into something like the manifest JSON?

Yes, this is something we want to achieve with this effort. I would be interested to learn more what metadata would be useful for your use case. Do you have any example to share.

so I am also wondering if the proposed methods here would accomplish this and make it easier to put all these in a single config file?

The main goal is to go beyond the concept of publishDir and provide a unified specification for the workflow output definition in the main workflow script or the nextflow config file.

@marcodelapierre
Copy link
Member

@pditommaso update after the team meeting.

@bentsherman consulted with SciDev, and as a result he drafted a 2nd POC based on alternate approach. He is also going to discuss the two POCs with the community during the current Hackathon (19-20 March)

So now we have:

  1. POC based on process selectors: first go, simple and concise but conceptually less elegant

  2. POC based on emitted channels: conceptually elegant, more verbose as it requires wiring channels to be published all the way til the final emit, crucially SciDev team gave positive feedback on this approach

Option 2. seems an excellent upgrade compared to the 1st iteration in terms of general design. Please Paolo, everyone, feel free to have a look and share feedback on it!

@pditommaso
Copy link
Member

The second one is still far from what was proposed:

  1. use path for aggregation instead of using the a grouping name
  2. use select and process names instead of topic names
  3. no index file creation
  4. inside the workflow block instead of a separate output scope

@bentsherman
Copy link
Member Author

bentsherman commented Mar 19, 2024

use select and process names instead of topic names

Topics do not work here, I have already explained this

no index file creation

This is more of an optional add-on, it can be easily added at any time if users would find it valuable

use path for aggregation instead of using the a grouping name
inside the workflow block instead of a separate output scope

These are minor differences that can be easily changed, I wouldn't focus on them

@marcodelapierre
Copy link
Member

marcodelapierre commented Mar 20, 2024

Paolo, also note that in the 2nd POC select is used on channels, not on processes

I agree that we should discuss 1. a couple of comments above, i.e. means of aggregating/specifying outptuts.
For instance on my end I am wondering if we could group these specifications in the emit block, i.e. by channel and not by path (e.g. add publishing attributes to those emitted channels that need it). In this way, we would have publishing specifications only under emit and optionally in the schema, with less repeatition. However, this might worsen readability/maintainability of the workflow. This is along similar lines of the comments by @adamrtalbot in the draft Fetchngs PR, and @pinin4fjords in the feature PR.
This being said, I also agree with Ben that this is a minor point of discussion, which can be postponed to after we have agreed on the general approach.

@marcodelapierre
Copy link
Member

Pinging also @ewels in case you have any comments to share :-)

@bentsherman
Copy link
Member Author

It's actually straightforward to make the channel topics flexible enough to work here. You just need an operator e.g. topic or intoTopic which allows you to send any channel into a topic:

FOO()
FOO.out.fastq | topic('fastq')

This way you can send output channels of a specific process invocation to a topic, just like the topic option on a process output. Essentially what @marcodelapierre proposed here... you were right again Marco!

Then we can support the topic approach in the output DSL:

output {
    path("${params.outdir}/fastq") {
         topic 'fastq'

         // just a shortcut for...
         select Channel.topic('fastq')
    }
}

The way I see it, channel selectors as shown in nf-core/fetchngs#302 are the core functionality, and things like topic selectors and index file creation are optional enhancements.

@pditommaso
Copy link
Member

Is there an example of solution 2 for rnaseq?

@pditommaso
Copy link
Member

It's actually straightforward to make the channel topics flexible enough to work here

Agree that could be a nice way to handle it

@marcodelapierre
Copy link
Member

It makes a lot of sense, glad to see how some pieces are coming together here ... very good team conversation! :-)

And by forwarding channels into topics beforehand, for certain pipelines we would also be able to reduce congestion in the emit: block

@marcodelapierre
Copy link
Member

Following up on Paolo's feedback, @bentsherman I would suggest to go ahead with implementation and POC along these lines

@adamrtalbot
Copy link
Collaborator

Topics adds another layer for someone to learn and understand, output channels is direct and similar to patterns they will already been using. I'm not 100% against using them, but I want to see a clear benefit over just using a channel.

@adamrtalbot
Copy link
Collaborator

Or make it optional!

// my_outputs = RNASEQ()

output {
    path("${params.outdir}/fastq") {
         // These three are synonymous
         topic  'fastq'
         select RNASEQ.out.fastq
         select my_outputs.fastq
    }
}

@bentsherman
Copy link
Member Author

@adamrtalbot agree 100%, I see channel topics as a convenience over the otherwise boring but reliable approach of channel selectors.

The nice thing about channel selectors is that we're already gonna expand the workflow emits in order to make it easier to import pipelines as subworkflows. Where I think topics could be useful is to publish outputs that you wouldn't typically emit to downstream workflows, i.e. intermediate outputs.

@pditommaso not yet, but I think I will try it once I update the Nextflow implementation. It should be a good way to test the usefulness of topics

@pinin4fjords
Copy link
Contributor

pinin4fjords commented Mar 21, 2024

Agree with @adamrtalbot (I think I'm just his echo most of the time) - I think topics is an unnecessary layer for the basic use case. In fact I'm a little fuzzy on them myself.

@pditommaso
Copy link
Member

Team, we are still prototyping this feature and I believe topic channels as suggested in the latest comments should included. Please go ahead.

@marcodelapierre
Copy link
Member

@bentsherman @pditommaso question I was reviewing our current implementation for topic channels.
This is not a blocker at all with going ahead, more of a thought for the 2nd iteration of refinements.

Right now (see e.g. docs, and also Ben's example above) the way to refer to topics is with the syntax:

channel.topic('my-topic').view()

As a topic is ultimately a type of channel, why not enabling a more consistent and concise syntax:

my-topic.view()    

This would be quite beneficial for the user experience, also enabling people to think of a topic as just any kind of channel.
E.g. I would imagine this addresses, at least partially, the concerns by @adamrtalbot and @pinin4fjords :

(quoting)

Topics adds another layer for someone to learn and understand, output channels is direct and similar to patterns they will already been using.

I think topics is an unnecessary layer for the basic use case. In fact I'm a little fuzzy on them myself.

In the end, as Ben said, using a topic channel can help publishing intermediate outputs in a concise way. In doing so, I would try and keep the user experience/learning curve as smooth as reasonably possible.

@pditommaso
Copy link
Member

why not enabling a more consistent and concise syntax

Broadly speaking, I believe topic channel needs to be called out explicitly due their "global" nature. However, I would consider a non-goal of this feature changing the topic syntax. Let's focus on the core functionality

@bentsherman
Copy link
Member Author

nf-core/fetchngs#275 is working with channel selectors. Topics were not needed.

Took a stab at the rnaseq poc, but it is much less straightforward overall with channel selectors. I might just try to get a rough approximation to start. Maybe do everything with topics first and then see where it can be simplified

@marcodelapierre FWIW I like the idea of making the topic like a global variable rather than a string literal, or at the very least not allowing topic names to be dynamic. That would allow the compiler to trace the use of a topic throughout the codebase, i.e. "go to definition" / "go to references". I'll keep that in mind for the language server

@bentsherman
Copy link
Member Author

Interestingly, the channel selector is just the publish operator that many have requested (#1540), with the topic selector being syntax sugar over that. The difference here is restricting the use of this "publish op" to a dedicated output block rather than allowing it to be used arbitrarily in the workflow logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants