Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Batch: Add disk size to slots calculation #4920

Open
adamrtalbot opened this issue Apr 16, 2024 · 7 comments
Open

Azure Batch: Add disk size to slots calculation #4920

adamrtalbot opened this issue Apr 16, 2024 · 7 comments

Comments

@adamrtalbot
Copy link
Collaborator

New feature

When using Azure Batch, Nextflow will reject a process if it has too many CPUs for the worker machine.

Caused by:
  Process requirement exceeds available CPUs -- req: 32; avail: 10

However, Azure Batch VMs come with a fixed disk and it's common that the Nextflow process runs out of storage. There are many, many issues about this on the Nextflow Slack! The typical workaround is to increase the number of CPUs an individual process requires, however it would be better to support the disk directive so we can directly enforce the VMs have the right sized disk.

Although we can't enforce it properly (i.e. make sure tasks are only assigned to a VM with enough space), being able to prevent users trying to run a task on a machine which is too small would catch some of the issues.

Usage scenario

When running on Azure Batch, raise an error if a task is assigned to a queue which does not contain sufficient storage.

Suggest implementation

process HELLO {
    disk 12.TB

    """
    echo Hello
    """
}

workflow {
    HELLLO()
}
Caused by:
  Process requirement exceeds available storage -- req: 10TB; avail: 1TB
@bentsherman
Copy link
Member

how would you determine the available disk storage for a given VM / queue?

@adamrtalbot
Copy link
Collaborator Author

I'm thinking way simpler than that.

if process.disk = '1024GB' and VM disk size is 512GB (we have the disk size) then prevent a job being submitted to that queue. It's not foolproof, but it would catch errors early.

@bentsherman
Copy link
Member

I see, so if the user specifies a machine type or a queue with an implied machine type then we can use disk to validate the requirement, but if the user specifies cpus and memory with auto-pools enabled then we could actually use disk to narrow the query of valid machine types from this list. Does that sound right?

@adamrtalbot
Copy link
Collaborator Author

adamrtalbot commented Apr 22, 2024

I see, so if the user specifies a machine type or a queue with an implied machine type then we can use disk to validate the requirement, but if the user specifies cpus and memory with auto-pools enabled then we could actually use disk to narrow the query of valid machine types from this list. Does that sound right?

Correct. It's not perfect, but it might help catch a few mistakes.

We would need to add the disk directive to these three places:

final vmType = guessBestVm(loc, cpus, mem, type)

AzVmType guessBestVm(String location, int cpus, MemoryUnit mem, String family) {
log.debug "[AZURE BATCH] guessing best VM given location=$location; cpus=$cpus; mem=$mem; family=$family"
if( !family.contains('*') && !family.contains('?') )
return findBestVm(location, cpus, mem, family)
// well this is a quite heuristic tentative to find a bigger instance to accommodate more tasks
AzVmType result=null
if( cpus<=4 ) {
result = findBestVm(location, cpus*4, mem!=null ? mem*4 : null, family)
if( !result )
result = findBestVm(location, cpus*2, mem!=null ? mem*2 : null, family)
}
else if( cpus <=8 ) {
result = findBestVm(location, cpus*2, mem!=null ? mem*2 : null, family)
}
if( !result )
result = findBestVm(location, cpus, mem, family)
return result
}

AzVmType guessBestVm(String location, int cpus, MemoryUnit mem, String family) {
log.debug "[AZURE BATCH] guessing best VM given location=$location; cpus=$cpus; mem=$mem; family=$family"
if( !family.contains('*') && !family.contains('?') )
return findBestVm(location, cpus, mem, family)
// well this is a quite heuristic tentative to find a bigger instance to accommodate more tasks
AzVmType result=null
if( cpus<=4 ) {
result = findBestVm(location, cpus*4, mem!=null ? mem*4 : null, family)
if( !result )
result = findBestVm(location, cpus*2, mem!=null ? mem*2 : null, family)
}
else if( cpus <=8 ) {
result = findBestVm(location, cpus*2, mem!=null ? mem*2 : null, family)
}
if( !result )
result = findBestVm(location, cpus, mem, family)
return result
}

@adamrtalbot
Copy link
Collaborator Author

Better idea, we just turn the disk size into one of the compute slots. E.g., a job that requires 1 cpu, 1gb of memory and 128gb of storage on a machine with 16 cores, 64gb of memory and 256gb of storage would currently occupy 1 slot. If we update the system it will occupy 8/16 slots. See relevant code here:

protected int computeSlots(int cpus, MemoryUnit mem, int vmCpus, MemoryUnit vmMem) {
// cpus requested should not exceed max cpus avail
final cpuSlots = Math.min(cpus, vmCpus) as int
if( !mem || !vmMem )
return cpuSlots
// mem requested should not exceed max mem avail
final vmMemGb = vmMem.mega /_1GB as float
final memGb = mem.mega /_1GB as float
final mem0 = Math.min(memGb, vmMemGb)
return Math.max(cpuSlots, memSlots(mem0, vmMemGb, vmCpus))
}
protected int computeSlots(TaskRun task, AzVmPoolSpec pool) {
computeSlots(
task.config.getCpus(),
task.config.getMemory(),
pool.vmType.numberOfCores,
pool.vmType.memory )
}
protected int memSlots(float memGb, float vmMemGb, int vmCpus) {
BigDecimal result = memGb / (vmMemGb / vmCpus)
result.setScale(0, RoundingMode.UP).intValue()
}

@bentsherman bentsherman changed the title Azure Batch should support disk directive Azure Batch: Add disk size to slots calculation May 16, 2024
@bentsherman
Copy link
Member

Sounds good to me. Care to give it a go? 😄 You have everything you need from the TaskRun and AzVmPoolSpec which are provided to the slots function

@adamrtalbot
Copy link
Collaborator Author

Post summit. Maybe on the plane 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants