Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite published outputs only if they are stale #4729

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

bentsherman
Copy link
Member

Close #3372

This PR builds on #3933 , which ensures that files in a directory are explicitly published. Given this behavior, we can then compare the source and target checksums to decide whether to publish a file.

  • If the checksums match then the copy can be skipped
  • If the checksums don't match, then the target should always be overwritten, even if the task was cached, because if the previous run was terminated and publishing cancelled, the published file might be incomplete.

The checksum code is ripped from #3802 , as it is also needed for task provenance and the resumability for automatic cleanup (#3849). It tries to use a pre-computed checksum (such as ETag for cloud storage paths) where possible to avoid scanning the file contents.

The only qualm I have right now is that the checksum code doesn't use the ETag for Google cloud storage paths. This is because we use the built-in file system provider from the Google SDK rather than our own. Maybe we can just extend this FS provider and register it over the existing one, but I'm not sure yet how to do it.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link

netlify bot commented Feb 9, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 9edc796
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/664bbce9dfbe6900083d22f7
😎 Deploy Preview https://deploy-preview-4729--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Comment on lines 147 to 158
@Override
String getETag() {
if( metadata == null ) {
final googleOpts = GoogleOpts.create(Global.session as Session)
final client = StorageOptions.newBuilder()
.setProjectId(googleOpts.projectId)
.build()
.getService()
metadata = client.get(target.bucket(), target.toString())
}
return metadata?.getMd5ToHexString()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a custom NIO filesystem for Google cloud storage which just delegates to the existing one, but allows us to add custom behavior like this ETag support.

It turns out that the ETag returned by the Java SDK is a CRC32 instead of MD5. The MD5 is also available through this storage API, but I haven't gotten it to work yet. The metadata is always null, even though I provided the projectId and I have the credentials file. Need to investigate further when I have more time

@bentsherman
Copy link
Member Author

Some notes from discussion today (related to workflow output DSL):

  • checksum verification should be optional as it can be very expensive for non-object storage
  • should also have a lightweight alternative that simply checks the last modified timestamp and file size (similar to cache directive)

@bentsherman
Copy link
Member Author

bentsherman commented May 17, 2024

This PR is mostly covered by the workflow output definition #4784 . The only remaining piece is the ETagAware interface which uses object metadata instead of computing the checksum, which makes the deep overwrite mode more feasible.

I propose that we merge that piece and leave out the Google NIO filesystem for now since I never quite got it to work. I can finish it in a separate PR. Then we can move this PR forward for S3 and Azure.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation of published outputs
1 participant