Validation of published outputs #3372

robsyme · 2022-11-10T16:56:38Z

Published Output Validation

At the moment, Nextflow is conservative in when it writes published outputs. If a file exists in the publishDir location, then Nextflow does not re-publish the file. If the destination file is truncated, or has been changed by some external process, Nextflow doesn't check to see if the source (in the task work directory) is different to the destination (in the publishDir location).

It would be good if Nextflow made stronger checks above simple filename matching. In cases where the source and destination are both on S3, we can provide strong guarantees by checking the file contents hash in the etags provided by S3.

Usage scenario

Nextflow runs to completion
A user or external process makes changes to an output file (but does not change the file name)
Nextflow runs again with -resume

In this case, we would expect that Nextflow should notice that the published file is "stale", and that it needs to be re-published.

Suggest implementation

In cases where both files are on S3, we can do cheap file integrity checking by the hash value in the S3 etags.

The text was updated successfully, but these errors were encountered:

pditommaso · 2022-11-11T08:44:22Z

This is a great point. When the etag for both source and target files is the same. The copy can be ignored

nextflow/plugins/nf-amazon/src/main/com/upplication/s3fs/S3FileSystemProvider.java

Lines 610 to 627 in 34f133e

    
           AmazonS3Client client = s3Source.getFileSystem() .getClient(); 
        
           final ObjectMetadata sourceObjMetadata = s3Source.getFileSystem().getClient().getObjectMetadata(s3Source.getBucket(), s3Source.getKey()); 
        
           final S3MultipartOptions opts = props != null ? new S3MultipartOptions(props) : new S3MultipartOptions(); 
        
           final long maxSize = opts.getMaxCopySize(); 
        
           final long length = sourceObjMetadata.getContentLength(); 
        
           final List<Tag> tags = ((S3Path) target).getTagsList(); 
        
           final String contentType = ((S3Path) target).getContentType(); 
        
           if( length <= maxSize ) { 
        
           	CopyObjectRequest copyObjRequest = new CopyObjectRequest(s3Source.getBucket(), s3Source.getKey(),s3Target.getBucket(), s3Target.getKey()); 
        
           	log.trace("Copy file via copy object - source: source={}, target={}, tags={}", s3Source, s3Target, tags); 
        
           	client.copyObject(copyObjRequest, tags, contentType); 
        
           } 
        
           else { 
        
           	log.trace("Copy file via multipart upload - source: source={}, target={}, tags={}", s3Source, s3Target, tags); 
        
           	client.multipartCopyObject(s3Source, s3Target, length, opts, tags, contentType); 
        
           }

pditommaso · 2022-11-23T20:00:08Z

Was looking into this, and the main problem is that there's no etag for directory paths, that's a very common case for NF publishDir.

robsyme · 2022-11-24T00:21:24Z

Ah, I see. Nextflow treats something like

output:
path("my_out_dir")

... as a single path, rather than the collection of files inside my_out_dir. Without knowing the contents of that directory, Nextflow can't do the etag comparison on each file.

As I see it, to make this possible, we'd need to change the publication mechanism to walk the tree in cases where the output is a directory. This sound like a pain... but if (big if) it is possible to walk that tree, it might give us a secondary win:

At the moment, if you choose as output as shown above, where we have a report inside that directory

my_out_dir
├── data.bam
└── report.html

and we have tower.yml

reports:
    "*.html":
        display: "Example html report"

... the html report is not uploaded to Tower because it is not explicitly an output (only its parent directory is). Perhaps if we end walking through the output directory, we can compare the etags on each file and pass it through TowerClient onFilePublish() to check if it should be uploaded to Tower as a report.

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerClient.groovy#L476-L480

pditommaso · 2022-11-24T10:13:22Z

Well, we could try to traverse the directory to check the etag file by file. We are doing something similar here with file sizes

nextflow/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy

Lines 338 to 343 in ef7b73a

    
           // the file must have the same size. this is needed 
        
           // to prevent re-using broken files left by a previous interrupted download 
        
           final attrs = Files.readAttributes(source, BasicFileAttributes) 
        
           final same = attrs.isDirectory() 
        
                   ? checkDirIntegrity0(source, target) 
        
                   : attrs.size() == Files.size(target)

stale · 2024-03-17T12:22:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso added the priority label Nov 11, 2022

bentsherman added the storage/aws label Mar 2, 2023

robsyme mentioned this issue Mar 2, 2023

Validation of task outputs #3711

Open

bentsherman self-assigned this May 9, 2023

bentsherman linked a pull request May 10, 2023 that will close this issue

Walk file tree on directory publish #3933

Open

bentsherman added pri/high and removed priority labels Aug 14, 2023

pditommaso removed the pri/high label Sep 22, 2023

bentsherman mentioned this issue Oct 30, 2023

Feature request: Publish files only when workflow successful #4460

Closed

This was referenced Feb 8, 2024

Fusion symlink resolution doesn't work with directories #4725

Closed

Overwrite published outputs only if they are stale #4729

Open

stale bot added the stale label Mar 17, 2024

bentsherman removed the stale label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation of published outputs #3372

Validation of published outputs #3372

robsyme commented Nov 10, 2022

pditommaso commented Nov 11, 2022

pditommaso commented Nov 23, 2022 •

edited

robsyme commented Nov 24, 2022 •

edited

pditommaso commented Nov 24, 2022

stale bot commented Mar 17, 2024

Validation of published outputs #3372

Validation of published outputs #3372

Comments

robsyme commented Nov 10, 2022

Published Output Validation

Usage scenario

Suggest implementation

pditommaso commented Nov 11, 2022

pditommaso commented Nov 23, 2022 • edited

robsyme commented Nov 24, 2022 • edited

pditommaso commented Nov 24, 2022

stale bot commented Mar 17, 2024

pditommaso commented Nov 23, 2022 •

edited

robsyme commented Nov 24, 2022 •

edited