Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

Open
bena-nasa opened this issue Jun 20, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented Jun 20, 2023

The issue:
Right now when you run gcm_setup it explicitly asks you which type of architecture you wish to run on if running at NCCS, this is because have a predefined number of o-server nodes we would like to use, so the total number of tasks one needs is a function of the architecture. The bottom line is that your gcm_run.j slurp script will specify an architecture (--constraint) and number of nodes (--nodes) and the cores per node (--ntasks-per-node). This limits you to running on the architecture you asked for even if resource would be available on a different architecture which of course is not optimal.

As a concrete example at c720, running on a layout that requires 3456 cores for the model and you want 9 o-server nodes.

I will use the following abbreviations
N_M = number of model nodes
N_O = number of o-server nodes
Note that if 3456 doesn't to divide evenly I will use the ceiling for N_M
On cascade lake (45 cores per node)
77 N_M + 9 N_O = 3870 cores on 86 nodes
On skylake (40 cores per node)
87 N_M + 9 N_O = 3840 cores on 96 nodes
On Haskell (28 cores per node)
124 N_M + 9 N_O = 3724 cores on 131 nodes

Users have requested that it would be nice to get a configuration of the gcm_run.j script that is architecture independent. In that it would have no --constraint option.

After much discussion one idea we came up with:

One possibility would be to simply request the number of cores we want and assume that any remaining cores are just assigned to the IO server.
So one possibility would be to have a heuristic, maybe say we want there the IO-server to have ~10% of the model cores, so the user job would just specify a cores count.
In the above example, 3456 * .1 ~ 346

So assuming 10% you would want 3456 + 346 = 3802 cores and that is what the script would request.

Now you get the following (rounding up)
On cascade lake:
85 nodes = 77 N_M + 9 N_O)
On skylake:
95 nodes = 87 N_M + 8 N_O
On haswell:
136 nodes = 124 N_M + 12 N_O

So the gcm_run.j script would detect the total number of cores, the number of nodes, and the number of nodes needed by the model based the NX and NY and just calculate the available nodes left over.

The result is that the actual IO-server node number varies but the total tasks remains fix and will work without a constraint. This should provide a broadly applicable solution for users running a standard History configuration seeking a "work anywhere" at NCCS approach.

@bena-nasa bena-nasa added the enhancement New feature or request label Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants