Alltoallv transfer matrices #973

x41lakazam · 2024-05-02T16:37:11Z

Currently in alltoallv perftest, every rank is sending the same amount of data to every other rank, this PR implements the possibility to define the amount of data rank i should send to rank j (note that if this amount is not divisible by the data type, it is rounded down), using a matrix called "transfer matrix" (where the cell i,j is this amount of data).

Below is the detail of the implementation, the output and the parameters. We denote matrix execution or alltoallv execution the execution of one alltoallv using one transfer matrix.

Implementation

Currently ucc perftest runs N iterations for every message size, we decided to include all the alltoallv executions in each iteration. The run_single_coll_test method is responsible for running the N iterations, from now on let's denote those iterations the main iterations.
To do so we added an inner loop to run_single_coll_test, this loop is executing alltoallv with every transfer matrix provided. Therefore in each main iteration, all the matrices are executed.
At the beginning of the test, every matrix file is read and the matrices are stored in the memory. Then in the inner loop a method called coll->pre_run has been added to modify the collective arguments accordingly to the right matrix, this method is ran in each inner loop before the collective execution.
The send and receives buffers are allocated once in the beginning of the test, their allocation size is the biggest possible size (calculated across the matrices). Note that with a real workload, there might be additional time for the allocation & registration of these buffers (highly optimized in pytorch), this is not the case here.

Output interpretation

The main output (in stdout) is printing the average, min and max time it took the ranks to execute all the matrices, averaged over the main iterations (controlled by -n and -w). Therefore the only relevant piece here is the max time, as it represents the time it took, in average, to execute every collectives (i.e every matrix).
Note that after each alltoallv execution, a barrier is executed, therefore the ranks start at the same time and the maximum time represents the time since the first rank started the collective until the last rank finished. By the way, the time measurement don't include the barrier.

Additionally, the average execution time of each matrix is reported in the inner loop log (see parameters). In every main iteration, one line is written in this log, the elements of the line are separated by a space and represent the execution time for the matrix with the same index.

Parameters

The transfer matrix should be written in a file where each row contains the elements separated by a space, each element is a number of bytes. It support convenient unit usage of megabytes and gigabytes, in the format 1G or 5M.
There is support for multiple transfer matrices, which will be executed one after the other. All the transfer matrices should be in a directory and the path to this directory needs to be passed to the environment variable UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_DIR. The directory should contain only the matrices files, and the number of matrices should be passed in the environment variable. UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT.
Each transfer matrix should be in a file named after the index of the matrix, therefore the name of the file should be a number between 0 and UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT-1 (the directory should contain files named 0, 1, ...).
Note that if UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT doesn't match the number of files in the directory, an error will be thrown.

The -b (min_count) and -b (max_count) arguments of ucc perfest are not relevant here and should be set to 0 (-b 0 -e 0). This is because they control the message size and here we use the matrices to control this.

The -j (n_inner_iter) argument should be the same as UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT.

The inner loop log absolute path should be provided in the environment variable UCC_PT_COLL_INNER_LOG_FILE.

Notes

For now, an error will be thrown if the transfer matrices are not used (normal alltoallv is disabled)
We know this PR is a bit hacky.

This PR extends #967.

swx-jenkins3 · 2024-05-02T16:38:36Z

Can one of the admins verify this patch?

manjugv · 2024-05-29T13:58:20Z

@lappazos can we close this PR?

x41lakazam and others added 18 commits April 16, 2024 18:01

add transfer matrix parsing, not tested

b9796c9

.

514c825

Add transfer matrix feature

6766ff0

add run tests

23aec09

fix bug overflow

9301399

run

f9efd76

Fix argument 'f' not in optstring

8954f19

Add iter as an argument

ace951f

Add support for multiple transfer matrices

e96d26b

Add small doc

b2aef51

Run edit

e636321

Fix indentation

6543585

Fix indentation

8b909dd

Add support for multiple transfer matrices

c9943c2

Fix review

50bde28

rm useless directory

62c780a

Remove debug lines

047468d

typo

d570a36

Remove throw

e8a9c3b

janjust added the WIP - Don't Merge label May 10, 2024

x41lakazam closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alltoallv transfer matrices #973

Alltoallv transfer matrices #973

x41lakazam commented May 2, 2024 •

edited

swx-jenkins3 commented May 2, 2024

manjugv commented May 29, 2024

Alltoallv transfer matrices #973

Alltoallv transfer matrices #973

Conversation

x41lakazam commented May 2, 2024 • edited

Implementation

Output interpretation

Parameters

Notes

swx-jenkins3 commented May 2, 2024

manjugv commented May 29, 2024

x41lakazam commented May 2, 2024 •

edited