Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure Batch] Cannot use a different distro as the host because of the forced mount of /etc/ssl/certs and /etc/pki #4828

Open
luanjot opened this issue Mar 19, 2024 · 11 comments · May be fixed by #4888

Comments

@luanjot
Copy link

luanjot commented Mar 19, 2024

Bug report

Expected behavior and actual behavior

When using a centos/redhat container in a ubuntu host, the centos container should run normally

Steps to reproduce the problem

Start an Ubuntu azure batch instance and run

sudo docker run -ti -v /etc/ssl/certs:/etc/ssl/certs:ro -v /etc/pki:/etc/pki:ro

Program output

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:75: mounting "/etc/ssl/certs" to rootfs at "/etc/ssl/certs" caused: mkdir /mnt/resource/docker/overlay2/1639d1e2a072f2b3679abf18c69b1789b4339cd7deedd187786191ed0a52a6a4/merged/etc/pki/tls: read-only file system: unknown.
ERRO[0000] error waiting for container: context canceled

Environment

  • Nextflow version: 23.10.1
  • Operating system: Ubuntu Azure default VM

Additional context

This seems to be caused by something that is not configurable here
https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L398

Would it be possible to make it configurable?

@adamrtalbot
Copy link
Collaborator

Steps to reproduce:

main.nf:

process HELLO {
    container "docker.io/redhat/ubi8:8.9"

    output:
        stdout
    
    script:
    """
    echo "Hello, World!"
    """
}

workflow {
    HELLO()
}

nextflow.config:

process {
    executor = 'azurebatch'
}

azure {
    storage {
        accountName = "$AZURE_STORAGE_ACCOUNT_NAME"
        accountKey = "$AZURE_STORAGE_ACCOUNT_KEY"
    }
    batch {
        location = "$AZURE_LOCATION"
        accountName = "$AZURE_BATCH_ACCOUNT_NAME"
        accountKey = "$AZURE_BATCH_ACCOUNT_KEY"
        autoPoolMode = true
        deletePoolsOnCompletion = true
    }
}

Command:

nextflow run .

Output:

> nextflow run .
N E X T F L O W  ~  version 23.10.1
Launching `./main.nf` [backstabbing_visvesvaraya] DSL2 - revision: 11eb8eab2e
[66/716836] Submitted process > HELLO
ERROR ~ Error executing process > 'HELLO'

Caused by:
  The task exited with an exit code representing a failure

Command executed:

  echo "Hello, World!"

Command exit status:
  -

Command output:
  (empty)

Work dir:
  az://redacted/66/71683610e687c177c4fc78523019be

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

@pditommaso
Copy link
Member

I wonder why it's needed to mount the host cert. There isn't this need with other cloud providers

@luanjot
Copy link
Author

luanjot commented Apr 5, 2024

It says in the comment it is because of azcopy, but this is not really true in my experience

        // mount host certificates otherwise `azcopy` fails

@pditommaso
Copy link
Member

it should be tried if it can be removed

@adamrtalbot
Copy link
Collaborator

Works fine without it. Opening a PR now.

@adamrtalbot adamrtalbot linked a pull request Apr 5, 2024 that will close this issue
@pditommaso
Copy link
Member

@adamrtalbot my understanding that it's not working without mounting the host cert, right?

@adamrtalbot
Copy link
Collaborator

it seems to be working fine except for Fusion:

~ Test 'fusion-symlink.nf' run failed
   + '[' -z *** ']'
   + echo initial run
   initial run
   + /home/runner/work/nextflow/nextflow/nextflow -q run ../../fusion-symlink.nf -c .config
   + /home/runner/work/nextflow/nextflow/nextflow fs cp s3://nextflow-ci/work/ci-test/fusion-symlink/data.txt data.txt
   + cmp data.txt .expected
   + echo resumed run
   resumed run
   + /home/runner/work/nextflow/nextflow/nextflow -q run ../../fusion-symlink.nf -c .config -resume
   + /home/runner/work/nextflow/nextflow/nextflow fs cp s3://nextflow-ci/work/ci-test/fusion-symlink/data.txt data.txt
   ERROR ~ Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 58TZGJ578BCPVKN9; S3 Extended Request ID: 9DdNtqSv/OriaOZxmyDLO0VZesrxwlCEtM+1vW60TfIW/fNtaLdNMukn9Ghs2Wq87rOj3EvLyaMMMAK1YWa67Q==; Proxy: null)

From: https://github.com/nextflow-io/nextflow/actions/runs/8568846304/job/23483847575?pr=4888

@pditommaso
Copy link
Member

Even more weird. Fusion has its own certificate, should not depend on the distro cert. any clue @jordeu

@adamrtalbot
Copy link
Collaborator

Re-ran and that error disappeared. Just the missing file now.

@jordeu
Copy link
Collaborator

jordeu commented Apr 12, 2024

@adamrtalbot can I help? What is this missing file problem?

@adamrtalbot
Copy link
Collaborator

On PR #4888 I removed the SSL certs being passed to the Docker container because it didn't work between Ubuntu and CentOS images. However, it cant find the output file after completion. I thought it was Fusion but that error appears to be unrelated and has started working fine, so not Fusion.

https://github.com/nextflow-io/nextflow/actions/runs/8615767323/job/23612362775?pr=4888

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants