Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

robsyme · 2024-05-02T18:58:42Z

Bug report

Given a simple workflow that creates a file

workflow {
    Channel.value("Seqera")
    | MakeJson
    | PublishIdCard
}

process MakeJson {
    input: val(name)
    output: tuple val(name), path('*.json')
    script: 
    """
    echo '{"name":"$name"}' > id.${name}.json
    """
}

process PublishIdCard {
    publishDir 'results'

    input: tuple val(name), path("jsons/*")
    output: path("symlink.json")
    script: "ln -s jsons/id.${name}.json symlink.json"
}

... and configuration that sets fusion:

workDir = 's3://my-bucket-name/work'
wave.enabled = true
fusion.enabled = true
docker.enabled = true
process.scratch = false
process.container = 'ubuntu:latest'

Expected behavior and actual behavior

I would expect the contents of the file published at ./results/symlink.json to include the json document

{"name":"Seqera"}

...but instead it is a text file containing:

jsons/id.Seqera.json

Notes

The .fusion.symlinks document in the task working directory of the PublishIdCard task contains the text symlink.json, but Nextflow doesn't seem to follow the symlink when publishing.

The text was updated successfully, but these errors were encountered:

bentsherman · 2024-05-02T20:40:13Z

Not sure this is worth supporting. Normally we detect fusion symlinks by checking if the output file is also an input, which is cheap because the list of inputs is in memory. But detecting this would require checking for .fusion.symlnks for every task, including subdirectories. This is because, though this example is simple, there is no way to know in general if a task will link inputs to outputs. So now you're doing a bunch of extra S3 calls for every task, even if you never use this pattern.

Instead, I would recommend that you either copy the file or declare the input file as an output directly (e.g. output: path("jsons/id.${name}.json")). I understand this might be more verbose compared to e.g. a single output directory containing a bunch of different files. It might be more readable when we have static types #4553 and can output a record type representing a directory of diverse output files.

pditommaso · 2024-05-03T09:43:11Z

However, this is introducing an increasing inconsistency between Fusion and non-fusion usage.

Normally this is not a problem because the OS knows how to handle symlinks, which obviously is not the case for Fusion.

I'm starting to think it should be the role of Fusion to concretize symlinks when referenced in the publishDir.

Now that were are introducing the task meta-comment, it should be an easier task. Adding @jordeu for visibility

bentsherman · 2024-05-03T13:12:29Z

should be the role of Fusion to concretize symlinks when referenced in the publishDir

I assume you mean the process outputs? Because we are decoupling the publishing from the task execution with the workflow publish

So if a process output turns out to be a symlink -- whether because it was also an input or it was linked by the task, or some other reason -- Fusion should upload a copy to S3. Then Nextflow doesn't need to check for forwarded inputs when publishing files

This is actually necessary with the workflow publish, because files published by the workflow don't know the task they came from, so they can't check if they are a forwarded input

pditommaso · 2024-05-03T13:20:50Z

No, I mean only publish, because non-publish output files will be used by Fusion in downstream task and, I guess, Fusion is able to handle properly

bentsherman · 2024-05-03T13:22:16Z

Well, if we use the workflow publish instead of publishDir then the task doesn't know which outputs will be published

jordeu · 2024-05-03T13:28:42Z

What is the "the workflow publish"?

bentsherman · 2024-05-03T14:04:47Z

Workflow publish (i.e. output) definition: #4784

bentsherman · 2024-05-06T16:24:37Z

I was going to summarize the hybrid solution we discussed on the call, but thinking through it now I realize that it won't work. The workflow output definition makes it difficult for a task to know whether an output file will be published.

I think the best thing we can do instead is (1) give Fusion the output file patterns and (2) Fusion should always upload a copy if a symlink is an output file. This will only happen when an output file is a symlink, which is relatively uncommon. We can also remove the Fusion symlink logic from Nextflow.

pditommaso · 2024-05-06T17:21:53Z

I think the best thing we can do instead is (1) give Fusion the output file patterns and (2) Fusion should always upload a copy if a symlink is an output file. This will only happen when an output file is a symlink, which is relatively uncommon. We can also remove the Fusion symlink logic from Nextflow.

it makes sense to me

robsyme · 2024-05-08T12:29:32Z

Giving Fusion the output patterns is a huge win because it provides the opportunity for Fusion to prioritize uploading particular files and potentially abandon the upload of non-output files when the task is complete.

jordeu · 2024-05-08T13:03:41Z

Distinguishing between input, intermediate, and output files in the working directory opens the door to many optimizations.

But still, I'm not convinced that it is a good idea to "materialize" all the outputs as objects when they are symbolic links. Imagine, for example, that the output is a symbolic link to a folder with thousands of small files. Materializing all these thousands of files in an object store is a costly operation.

bentsherman · 2024-05-08T13:07:15Z

Keep in mind that it only heeds to happen when the user creates such a link or forwards an input as an output. These cases are uncommon and can be easily avoided by writing the pipeline differently

bentsherman added the storage/fusion label May 2, 2024

This comment was marked as outdated.

Sign in to view

bentsherman mentioned this issue May 6, 2024

Fix Fusion symlink resolution with nested files #4726

Closed

bentsherman mentioned this issue May 16, 2024

Remove Fusion symlink resolution #5004

Merged

bentsherman linked a pull request May 16, 2024 that will close this issue

Remove Fusion symlink resolution #5004

Merged

pditommaso closed this as completed in #5004 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

robsyme commented May 2, 2024 •

edited by pditommaso

bentsherman commented May 2, 2024

This comment was marked as outdated.

pditommaso commented May 3, 2024

bentsherman commented May 3, 2024

pditommaso commented May 3, 2024

bentsherman commented May 3, 2024

jordeu commented May 3, 2024

bentsherman commented May 3, 2024

bentsherman commented May 6, 2024

pditommaso commented May 6, 2024

robsyme commented May 8, 2024

jordeu commented May 8, 2024

bentsherman commented May 8, 2024

Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

Comments

robsyme commented May 2, 2024 • edited by pditommaso

Bug report

Expected behavior and actual behavior

Notes

bentsherman commented May 2, 2024

This comment was marked as outdated.

pditommaso commented May 3, 2024

bentsherman commented May 3, 2024

pditommaso commented May 3, 2024

bentsherman commented May 3, 2024

jordeu commented May 3, 2024

bentsherman commented May 3, 2024

bentsherman commented May 6, 2024

pditommaso commented May 6, 2024

robsyme commented May 8, 2024

jordeu commented May 8, 2024

bentsherman commented May 8, 2024

robsyme commented May 2, 2024 •

edited by pditommaso