-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alltoallv transfer matrices #973
Closed
x41lakazam
wants to merge
19
commits into
openucx:master
from
x41lakazam:alltoallv_transfer_matrix_multi
Closed
Alltoallv transfer matrices #973
x41lakazam
wants to merge
19
commits into
openucx:master
from
x41lakazam:alltoallv_transfer_matrix_multi
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently in alltoallv perftest, every rank is sending the same amount of data to every other rank, this PR implements the possibility to define the amount of data rank i should send to rank j (note that if this amount is not divisible by the data type, it is rounded down), using a matrix called "transfer matrix" (where the cell i,j is this amount of data).
Below is the detail of the implementation, the output and the parameters. We denote matrix execution or alltoallv execution the execution of one alltoallv using one transfer matrix.
Implementation
Currently ucc perftest runs N iterations for every message size, we decided to include all the alltoallv executions in each iteration. The
run_single_coll_test
method is responsible for running the N iterations, from now on let's denote those iterations the main iterations.To do so we added an inner loop to
run_single_coll_test
, this loop is executing alltoallv with every transfer matrix provided. Therefore in each main iteration, all the matrices are executed.At the beginning of the test, every matrix file is read and the matrices are stored in the memory. Then in the inner loop a method called
coll->pre_run
has been added to modify the collective arguments accordingly to the right matrix, this method is ran in each inner loop before the collective execution.The send and receives buffers are allocated once in the beginning of the test, their allocation size is the biggest possible size (calculated across the matrices). Note that with a real workload, there might be additional time for the allocation & registration of these buffers (highly optimized in pytorch), this is not the case here.
Output interpretation
The main output (in stdout) is printing the average, min and max time it took the ranks to execute all the matrices, averaged over the main iterations (controlled by
-n
and-w
). Therefore the only relevant piece here is the max time, as it represents the time it took, in average, to execute every collectives (i.e every matrix).Note that after each alltoallv execution, a barrier is executed, therefore the ranks start at the same time and the maximum time represents the time since the first rank started the collective until the last rank finished. By the way, the time measurement don't include the barrier.
Additionally, the average execution time of each matrix is reported in the inner loop log (see parameters). In every main iteration, one line is written in this log, the elements of the line are separated by a space and represent the execution time for the matrix with the same index.
Parameters
The transfer matrix should be written in a file where each row contains the elements separated by a space, each element is a number of bytes. It support convenient unit usage of megabytes and gigabytes, in the format 1G or 5M.
There is support for multiple transfer matrices, which will be executed one after the other. All the transfer matrices should be in a directory and the path to this directory needs to be passed to the environment variable
UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_DIR
. The directory should contain only the matrices files, and the number of matrices should be passed in the environment variable.UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT
.Each transfer matrix should be in a file named after the index of the matrix, therefore the name of the file should be a number between 0 and
UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT-1
(the directory should contain files named0
,1
, ...).Note that if
UCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT
doesn't match the number of files in the directory, an error will be thrown.The
-b
(min_count
) and-b
(max_count
) arguments of ucc perfest are not relevant here and should be set to 0 (-b 0 -e 0
). This is because they control the message size and here we use the matrices to control this.The
-j
(n_inner_iter
) argument should be the same asUCC_PT_COLL_ALLTOALLV_TRANSFER_MATRICES_COUNT
.The inner loop log absolute path should be provided in the environment variable
UCC_PT_COLL_INNER_LOG_FILE
.Notes
This PR extends #967.