-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation of published outputs #3372
Comments
This is a great point. When the etag for both source and target files is the same. The copy can be ignored nextflow/plugins/nf-amazon/src/main/com/upplication/s3fs/S3FileSystemProvider.java Lines 610 to 627 in 34f133e
|
Was looking into this, and the main problem is that there's no etag for directory paths, that's a very common case for NF |
Ah, I see. Nextflow treats something like output:
path("my_out_dir") ... as a single path, rather than the collection of files inside As I see it, to make this possible, we'd need to change the publication mechanism to walk the tree in cases where the output is a directory. This sound like a pain... but if (big if) it is possible to walk that tree, it might give us a secondary win: At the moment, if you choose as output as shown above, where we have a report inside that directory
and we have reports:
"*.html":
display: "Example html report" ... the html report is not uploaded to Tower because it is not explicitly an output (only its parent directory is). Perhaps if we end walking through the output directory, we can compare the etags on each file and pass it through TowerClient |
Well, we could try to traverse the directory to check the etag file by file. We are doing something similar here with file sizes nextflow/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy Lines 338 to 343 in ef7b73a
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Published Output Validation
At the moment, Nextflow is conservative in when it writes published outputs. If a file exists in the publishDir location, then Nextflow does not re-publish the file. If the destination file is truncated, or has been changed by some external process, Nextflow doesn't check to see if the source (in the task work directory) is different to the destination (in the publishDir location).
It would be good if Nextflow made stronger checks above simple filename matching. In cases where the source and destination are both on S3, we can provide strong guarantees by checking the file contents hash in the etags provided by S3.
Usage scenario
-resume
In this case, we would expect that Nextflow should notice that the published file is "stale", and that it needs to be re-published.
Suggest implementation
In cases where both files are on S3, we can do cheap file integrity checking by the hash value in the S3 etags.
The text was updated successfully, but these errors were encountered: