Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

So how many GPUs are on a Perlmutter GPU node again? #851

Open
Baharis opened this issue Mar 3, 2023 · 9 comments
Open

So how many GPUs are on a Perlmutter GPU node again? #851

Baharis opened this issue Mar 3, 2023 · 9 comments
Assignees

Comments

@Baharis
Copy link
Contributor

Baharis commented Mar 3, 2023

Perlmutter GPU node features 1x AMD EPYC 7763 CPU and 4x NVIDIA A100 GPUs (link). Therefore, it would be reasonable to assume that when running scripts which utilize CUDA or KOKKOS, environment variable CCTBX_GPUS_PER_NODE should be set to 4. To my surprise, I discovered today that setting it to anything but 1 causes a CUDA assertion error:

`GPUassert: invalid device ordinal /global/cfs/cdirs/m3562/users/dtchon/p20231/alcc-recipes2/cctbx/modules/cctbx_project/simtbx/diffBragg/src/diffBraggCUDA.cu 70`

This might be an intended behavior, but I found it confusing. Following @JBlaschke suggestion, I made this issue to discuss it.

The issue can be recreated by running the following file: /global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/8mosaic/debug/mosaic_lastfiles.sh on a Perlmutter GPU interactive node: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15. In you don't have an access or a cctbx installation including psii_spread workers, the relevant portion of the file is essentially:

export DIFFBRAGG_USE_CUDA=1
export CUDA_LAUNCH_BLOCKING=1
export NUMEXPR_MAX_THREADS=128
export SLURM_CPU_BIND=cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export CCTBX_GPUS_PER_NODE=2
srun -n 32 -c 4 --ntasks-per-gpu=8 cctbx.xfel.merge trial8.phil
@dermen
Copy link
Contributor

dermen commented Mar 3, 2023

Is this running on a single node ? I see this line

# use with: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15 --ntasks-per-gpu=1

What if you change it to

salloc -N1 --cpus-per-gpu=8 --tasks-per-node=32 --gpus-per-node=4 -A m3562_g -t 15 -q interactive -C gpu

and then when you run your job, use

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 cctbx.xfel.merge trial8.phil

I find these arguments generally work well for me.

Beyond this, Im not readily familiar with how the flag CCTBX_GPUS_PER_NODE is used in the merging program. In a diffBragg instance , the device_Id property will determine the GPU used at runtime. In this case it should be a number 0,1,2 or 3.

Also, can you verify there are indeed 4 GPUs when you run nvidia-smi in your interactive session ?

@Baharis
Copy link
Contributor Author

Baharis commented Mar 3, 2023

I confirm nvidia-smi called in an interactive session reports 4 GPUs.

This particular cctbx.xfel.merge run uses a custom annulus worker, which calls diffBragg. It looks like setting CCTBX_GPUS_PER_NODE to more than 1 causes individual workers to look for more than one GPU, while every node has access to only one.

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 does not fail when CCTBX_GPUS_PER_NODE=2 is set. I ran some tests and it looks like either --gpus-per-node=4 or --cpus-per-gpu=8 is necessary here — if both are absent, CCTBX_GPUS_PER_NODE can not be > 1. This makes sense – any of these is needed to deduce that 4 GPUs exist. However, I successfully used the original script to engage all 128 hyperthreads on a GPU with CCTBX_GPUS_PER_NODE=1, so it makes me wonder even more about the role if this individual parameter. Maybe it is just not necessary if every instance of diffBragg utilizes only 1 GPU, as is the case here?

On a side note, according to top information, srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 uses only 50% of CPUs (# 0–31, # 64-95). After adding -c 4 all 128 hyperthreads (64 physical cores) seem to be used. Though it baffles me how a merge worker can utilize two different physical cores at the same time – I though workers are single-threaded.

@dermen
Copy link
Contributor

dermen commented Mar 3, 2023

If

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4

then I would expect you to set CCTBX_GPUS_PER_NODE=4. Does that work? You can ssh into your interactive node from a different terminal and call nvidia-smi to see how many GPUs are in use:

dermen@perlmutter:login27:~> salloc -N1 --cpus-per-gpu=8 --tasks-per-node=32 --gpus-per-node=4 -A m4326_g -t 15 -q interactive -C gpu
salloc: Pending job allocation 5886829
salloc: job 5886829 queued and waiting for resources
salloc: job 5886829 has been allocated resources
salloc: Granted job allocation 5886829
salloc: Waiting for resource configuration
salloc: Nodes nid001016 are ready for job
# launch your job utilizing all 4 GPUs
dermen@nid001016:~>  srun  ... .. etc

# and then I think the following should work:
# Open a new perlmutter teminal and do
ssh dermen@nid0010116
nvidia-smi
# and observe how many GPUs have processes running... should be 4!

Also, as far as using the virtual threads, some libraries will use them if available, it might not be the merge code specifically, maybe some lower-level code.

Where is the code for the annulus worker calling diffBragg? Is this in LS49?

@Baharis
Copy link
Contributor Author

Baharis commented Mar 3, 2023

I had nothing particular in mind when setting CCTBX_GPUS_PER_NODE=2 instead of =4 – I was just testing anything > 1.

Thats a fair question about GPU use – I looked at CPUs only, and somehow neglected to check GPUs themselves. I have just tested both approaches. My idea, srun -n 32 -c 4 --ntasks-per-gpu=8 with CCTBX_GPUS_PER_NODE=1, uses all 128 logical cores and all 4 GPUs, according to the output of top 1 and watch nvidia-smi. Your suggestion,srun -n 32 --cpus-per-gpu=8 --gpus-per-node=4 with CCTBX_GPUS_PER_NODE=4, utilizes all 4 GPUs. However, same command with CCTBX_GPUS_PER_NODE=2 or =1 uses only half or quarder of GPUs, as could be expected (but still only half of CPUs!). So while both setting ntasks-per-gpu and cpus-per-gpu/gpus-per-node can give the same outcome, some underlaying logic in GPU communication is clearly different.

The code for annulus worker is in psii_spread repository. A workflow that utilizes it is in LS49 SPREAD README, steps 8–11.

@dermen
Copy link
Contributor

dermen commented Mar 3, 2023

I think the default -c is 2 (just guessing), could you then add -c4 to my command to use 128 threads ?

@dermen
Copy link
Contributor

dermen commented Mar 3, 2023

Oh, and if your code is using numexpr , then I believe that would be whats utilizing the hyperthreading..

@Baharis
Copy link
Contributor Author

Baharis commented Mar 3, 2023

@dermen I think the default is -c 1, in which case physical processor #n which hosts virtual CPUs (hyper-threads) #n and #n+64 uses all its capacity to perform its task, thus using resources formally assigned to virtual CPU #n and #n+64. Yesterday I compared 64 and 128 tasks / GPU node – there was virtually no improvement when going from -c 1 to -c 2.

Adding -c 4 to your command causes it to use 128 threads as expected.

@Baharis
Copy link
Contributor Author

Baharis commented Mar 13, 2023

Closing because It seems the meaning of CCTBX_GPUS_PER_NODE depends heavily on context and there is no interest in further discussion.

@Baharis Baharis closed this as completed Mar 13, 2023
@JBlaschke
Copy link
Contributor

Can you reopen @Baharis ? I want to pick this up again maybe next week. Last week's PM outage and this week's travel upset my schedule a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants