Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alltoallv transfer matrices #973

Closed

Conversation

x41lakazam
Copy link

@x41lakazam x41lakazam commented May 2, 2024

Currently in alltoallv perftest, every rank is sending the same amount of data to every other rank, this PR implements the possibility to define the amount of data rank i should send to rank j (note that if this amount is not divisible by the data type, it is rounded down), using a matrix called "transfer matrix" (where the cell i,j is this amount of data).

Below is the detail of the implementation, the output and the parameters. We denote matrix execution or alltoallv execution the execution of one alltoallv using one transfer matrix.

Implementation

Currently ucc perftest runs N iterations for every message size, we decided to include all the alltoallv executions in each iteration. The run_single_coll_test method is responsible for running the N iterations, from now on let's denote those iterations the main iterations.
To do so we added an inner loop to run_single_coll_test, this loop is executing alltoallv with every transfer matrix provided. Therefore in each main iteration, all the matrices are executed.
At the beginning of the test, every matrix file is read and the matrices are stored in the memory. Then in the inner loop a method called coll->pre_run has been added to modify the collective arguments accordingly to the right matrix, this method is ran in each inner loop before the collective execution.
The send and receives buffers are allocated once in the beginning of the test, their allocation size is the biggest possible size (calculated across the matrices). Note that with a real workload, there might be additional time for the allocation & registration of these buffers (highly optimized in pytorch), this is not the case here.

Output interpretation

The main output (in stdout) is printing the average, min and max time it took the ranks to execute all the matrices, averaged over the main iterations (controlled by -n and -w). Therefore the only relevant piece here is the max time, as it represents the time it took, in average, to execute every collectives (i.e every matrix).
Note that after each alltoallv execution, a barrier is executed, therefore the ranks start at the same time and the maximum time represents the time since the first rank started the collective until the last rank finished. By the way, the time measurement don't include the barrier.

Additionally, the average execution time of each matrix is reported in the inner loop log (see parameters). In every main iteration, one line is written in this log, the elements of the line are separated by a space and represent the execution time for the matrix with the same index.

Parameters

The transfer matrix should be written in a file where each row contains the elements separated by a space, each element is a number of bytes. It support convenient unit usage of megabytes and gigabytes, in the format 1G or 5M.
There is support for multiple transfer matrices, which will be executed one after the other. All the transfer matrices should be in a directory and the path to this directory needs to be passed to the environment variable UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_DIR. The directory should contain only the matrices files, and the number of matrices should be passed in the environment variable. UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT.
Each transfer matrix should be in a file named after the index of the matrix, therefore the name of the file should be a number between 0 and UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT-1 (the directory should contain files named 0, 1, ...).
Note that if UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT doesn't match the number of files in the directory, an error will be thrown.

The -b (min_count) and -b (max_count) arguments of ucc perfest are not relevant here and should be set to 0 (-b 0 -e 0). This is because they control the message size and here we use the matrices to control this.

The -j (n_inner_iter) argument should be the same as UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT.

The inner loop log absolute path should be provided in the environment variable UCC_PT_COLL_INNER_LOG_FILE.

Notes

  • For now, an error will be thrown if the transfer matrices are not used (normal alltoallv is disabled)
  • We know this PR is a bit hacky.

This PR extends #967.

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@manjugv
Copy link
Contributor

manjugv commented May 29, 2024

@lappazos can we close this PR?

@x41lakazam x41lakazam closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants