So how many GPUs are on a Perlmutter GPU node again? #851

Baharis · 2023-03-03T03:17:09Z

Perlmutter GPU node features 1x AMD EPYC 7763 CPU and 4x NVIDIA A100 GPUs (link). Therefore, it would be reasonable to assume that when running scripts which utilize CUDA or KOKKOS, environment variable CCTBX_GPUS_PER_NODE should be set to 4. To my surprise, I discovered today that setting it to anything but 1 causes a CUDA assertion error:

`GPUassert: invalid device ordinal /global/cfs/cdirs/m3562/users/dtchon/p20231/alcc-recipes2/cctbx/modules/cctbx_project/simtbx/diffBragg/src/diffBraggCUDA.cu 70`

This might be an intended behavior, but I found it confusing. Following @JBlaschke suggestion, I made this issue to discuss it.

The issue can be recreated by running the following file: /global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/8mosaic/debug/mosaic_lastfiles.sh on a Perlmutter GPU interactive node: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15. In you don't have an access or a cctbx installation including psii_spread workers, the relevant portion of the file is essentially:

export DIFFBRAGG_USE_CUDA=1
export CUDA_LAUNCH_BLOCKING=1
export NUMEXPR_MAX_THREADS=128
export SLURM_CPU_BIND=cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export CCTBX_GPUS_PER_NODE=2
srun -n 32 -c 4 --ntasks-per-gpu=8 cctbx.xfel.merge trial8.phil

The text was updated successfully, but these errors were encountered:

dermen · 2023-03-03T03:35:43Z

Is this running on a single node ? I see this line

# use with: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15 --ntasks-per-gpu=1

What if you change it to

salloc -N1 --cpus-per-gpu=8 --tasks-per-node=32 --gpus-per-node=4 -A m3562_g -t 15 -q interactive -C gpu

and then when you run your job, use

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 cctbx.xfel.merge trial8.phil

I find these arguments generally work well for me.

Beyond this, Im not readily familiar with how the flag CCTBX_GPUS_PER_NODE is used in the merging program. In a diffBragg instance , the device_Id property will determine the GPU used at runtime. In this case it should be a number 0,1,2 or 3.

Also, can you verify there are indeed 4 GPUs when you run nvidia-smi in your interactive session ?

Baharis · 2023-03-03T06:27:31Z

I confirm nvidia-smi called in an interactive session reports 4 GPUs.

This particular cctbx.xfel.merge run uses a custom annulus worker, which calls diffBragg. It looks like setting CCTBX_GPUS_PER_NODE to more than 1 causes individual workers to look for more than one GPU, while every node has access to only one.

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 does not fail when CCTBX_GPUS_PER_NODE=2 is set. I ran some tests and it looks like either --gpus-per-node=4 or --cpus-per-gpu=8 is necessary here — if both are absent, CCTBX_GPUS_PER_NODE can not be > 1. This makes sense – any of these is needed to deduce that 4 GPUs exist. However, I successfully used the original script to engage all 128 hyperthreads on a GPU with CCTBX_GPUS_PER_NODE=1, so it makes me wonder even more about the role if this individual parameter. Maybe it is just not necessary if every instance of diffBragg utilizes only 1 GPU, as is the case here?

On a side note, according to top information, srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4 uses only 50% of CPUs (# 0–31, # 64-95). After adding -c 4 all 128 hyperthreads (64 physical cores) seem to be used. Though it baffles me how a merge worker can utilize two different physical cores at the same time – I though workers are single-threaded.

dermen · 2023-03-03T16:07:35Z

If

srun -N1 --tasks-per-node=32 --cpus-per-gpu=8 --gpus-per-node=4

then I would expect you to set CCTBX_GPUS_PER_NODE=4. Does that work? You can ssh into your interactive node from a different terminal and call nvidia-smi to see how many GPUs are in use:

dermen@perlmutter:login27:~> salloc -N1 --cpus-per-gpu=8 --tasks-per-node=32 --gpus-per-node=4 -A m4326_g -t 15 -q interactive -C gpu
salloc: Pending job allocation 5886829
salloc: job 5886829 queued and waiting for resources
salloc: job 5886829 has been allocated resources
salloc: Granted job allocation 5886829
salloc: Waiting for resource configuration
salloc: Nodes nid001016 are ready for job
# launch your job utilizing all 4 GPUs
dermen@nid001016:~>  srun  ... .. etc

# and then I think the following should work:
# Open a new perlmutter teminal and do
ssh dermen@nid0010116
nvidia-smi
# and observe how many GPUs have processes running... should be 4!

Also, as far as using the virtual threads, some libraries will use them if available, it might not be the merge code specifically, maybe some lower-level code.

Where is the code for the annulus worker calling diffBragg? Is this in LS49?

Baharis · 2023-03-03T20:10:44Z

I had nothing particular in mind when setting CCTBX_GPUS_PER_NODE=2 instead of =4 – I was just testing anything > 1.

Thats a fair question about GPU use – I looked at CPUs only, and somehow neglected to check GPUs themselves. I have just tested both approaches. My idea, srun -n 32 -c 4 --ntasks-per-gpu=8 with CCTBX_GPUS_PER_NODE=1, uses all 128 logical cores and all 4 GPUs, according to the output of top 1 and watch nvidia-smi. Your suggestion,srun -n 32 --cpus-per-gpu=8 --gpus-per-node=4 with CCTBX_GPUS_PER_NODE=4, utilizes all 4 GPUs. However, same command with CCTBX_GPUS_PER_NODE=2 or =1 uses only half or quarder of GPUs, as could be expected (but still only half of CPUs!). So while both setting ntasks-per-gpu and cpus-per-gpu/gpus-per-node can give the same outcome, some underlaying logic in GPU communication is clearly different.

The code for annulus worker is in psii_spread repository. A workflow that utilizes it is in LS49 SPREAD README, steps 8–11.

dermen · 2023-03-03T20:23:15Z

I think the default -c is 2 (just guessing), could you then add -c4 to my command to use 128 threads ?

dermen · 2023-03-03T20:31:23Z

Oh, and if your code is using numexpr , then I believe that would be whats utilizing the hyperthreading..

Baharis · 2023-03-03T21:21:30Z

@dermen I think the default is -c 1, in which case physical processor #n which hosts virtual CPUs (hyper-threads) #n and #n+64 uses all its capacity to perform its task, thus using resources formally assigned to virtual CPU #n and #n+64. Yesterday I compared 64 and 128 tasks / GPU node – there was virtually no improvement when going from -c 1 to -c 2.

Adding -c 4 to your command causes it to use 128 threads as expected.

Baharis · 2023-03-13T20:34:53Z

Closing because It seems the meaning of CCTBX_GPUS_PER_NODE depends heavily on context and there is no interest in further discussion.

JBlaschke · 2023-03-13T22:47:32Z

Can you reopen @Baharis ? I want to pick this up again maybe next week. Last week's PM outage and this week's travel upset my schedule a bit.

Baharis closed this as completed Mar 13, 2023

Baharis reopened this Mar 14, 2023

Baharis assigned Baharis and JBlaschke Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So how many GPUs are on a Perlmutter GPU node again? #851

So how many GPUs are on a Perlmutter GPU node again? #851

Baharis commented Mar 3, 2023

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023 •

edited

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023

dermen commented Mar 3, 2023

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023 •

edited

Baharis commented Mar 13, 2023

JBlaschke commented Mar 13, 2023

So how many GPUs are on a Perlmutter GPU node again? #851

So how many GPUs are on a Perlmutter GPU node again? #851

Comments

Baharis commented Mar 3, 2023

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023 • edited

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023

dermen commented Mar 3, 2023

dermen commented Mar 3, 2023

Baharis commented Mar 3, 2023 • edited

Baharis commented Mar 13, 2023

JBlaschke commented Mar 13, 2023

Baharis commented Mar 3, 2023 •

edited

Baharis commented Mar 3, 2023 •

edited