Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of published outputs #3372

Open
robsyme opened this issue Nov 10, 2022 · 5 comments · May be fixed by #3933 or #4729
Open

Validation of published outputs #3372

robsyme opened this issue Nov 10, 2022 · 5 comments · May be fixed by #3933 or #4729
Assignees

Comments

@robsyme
Copy link
Collaborator

robsyme commented Nov 10, 2022

Published Output Validation

At the moment, Nextflow is conservative in when it writes published outputs. If a file exists in the publishDir location, then Nextflow does not re-publish the file. If the destination file is truncated, or has been changed by some external process, Nextflow doesn't check to see if the source (in the task work directory) is different to the destination (in the publishDir location).

It would be good if Nextflow made stronger checks above simple filename matching. In cases where the source and destination are both on S3, we can provide strong guarantees by checking the file contents hash in the etags provided by S3.

Usage scenario

  1. Nextflow runs to completion
  2. A user or external process makes changes to an output file (but does not change the file name)
  3. Nextflow runs again with -resume

In this case, we would expect that Nextflow should notice that the published file is "stale", and that it needs to be re-published.

Suggest implementation

In cases where both files are on S3, we can do cheap file integrity checking by the hash value in the S3 etags.

@pditommaso
Copy link
Member

This is a great point. When the etag for both source and target files is the same. The copy can be ignored

AmazonS3Client client = s3Source.getFileSystem() .getClient();
final ObjectMetadata sourceObjMetadata = s3Source.getFileSystem().getClient().getObjectMetadata(s3Source.getBucket(), s3Source.getKey());
final S3MultipartOptions opts = props != null ? new S3MultipartOptions(props) : new S3MultipartOptions();
final long maxSize = opts.getMaxCopySize();
final long length = sourceObjMetadata.getContentLength();
final List<Tag> tags = ((S3Path) target).getTagsList();
final String contentType = ((S3Path) target).getContentType();
if( length <= maxSize ) {
CopyObjectRequest copyObjRequest = new CopyObjectRequest(s3Source.getBucket(), s3Source.getKey(),s3Target.getBucket(), s3Target.getKey());
log.trace("Copy file via copy object - source: source={}, target={}, tags={}", s3Source, s3Target, tags);
client.copyObject(copyObjRequest, tags, contentType);
}
else {
log.trace("Copy file via multipart upload - source: source={}, target={}, tags={}", s3Source, s3Target, tags);
client.multipartCopyObject(s3Source, s3Target, length, opts, tags, contentType);
}

@pditommaso
Copy link
Member

pditommaso commented Nov 23, 2022

Was looking into this, and the main problem is that there's no etag for directory paths, that's a very common case for NF publishDir.

@robsyme
Copy link
Collaborator Author

robsyme commented Nov 24, 2022

Ah, I see. Nextflow treats something like

output:
path("my_out_dir")

... as a single path, rather than the collection of files inside my_out_dir. Without knowing the contents of that directory, Nextflow can't do the etag comparison on each file.

As I see it, to make this possible, we'd need to change the publication mechanism to walk the tree in cases where the output is a directory. This sound like a pain... but if (big if) it is possible to walk that tree, it might give us a secondary win:

At the moment, if you choose as output as shown above, where we have a report inside that directory

my_out_dir
├── data.bam
└── report.html

and we have tower.yml

reports:
    "*.html":
        display: "Example html report"

... the html report is not uploaded to Tower because it is not explicitly an output (only its parent directory is). Perhaps if we end walking through the output directory, we can compare the etags on each file and pass it through TowerClient onFilePublish() to check if it should be uploaded to Tower as a report.

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerClient.groovy#L476-L480

@pditommaso
Copy link
Member

Well, we could try to traverse the directory to check the etag file by file. We are doing something similar here with file sizes

// the file must have the same size. this is needed
// to prevent re-using broken files left by a previous interrupted download
final attrs = Files.readAttributes(source, BasicFileAttributes)
final same = attrs.isDirectory()
? checkDirIntegrity0(source, target)
: attrs.size() == Files.size(target)

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
@bentsherman bentsherman removed the stale label Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants