-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic/tl ucp kn algs #361
base: master
Are you sure you want to change the base?
Topic/tl ucp kn algs #361
Conversation
622a074
to
e9fdc15
Compare
e9fdc15
to
ff59aa2
Compare
broot = coll_args->args.root; | ||
rank = VRANK(rank, broot, size); | ||
if (coll_args->args.coll_type == UCC_COLL_TYPE_ALLGATHER) { | ||
ucc_kn_ag_pattern_init(size, rank, radix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern init is also part of ucc_tl_ucp_allgather_knomial_start function, do we really need to do it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks you are right - not needed here. copy-pasted from reduce-scatter knomial. there pattern init is called there because i need to understand which ranks are extra/proxy - this info is part of kn pattern
} else { | ||
count = args->dst.info.count; | ||
start = NULL; | ||
ucc_kn_agx_pattern_init(size, rank, radix, count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to use VRANK(rank) here for sag bcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true - it is done above at line 201.
prevblock = | ||
ucc_ep_map_eval(task->reduce_scatter_ring.inv_map, prevblock); | ||
if (task->recv_completed == size - 1) { | ||
ucc_assert(prevblock == UCC_TL_TEAM_RANK(team)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this assert? first checking prevblock is rank and assign to rank in next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct - removed (part of debug)
team, task), | ||
task, out); | ||
|
||
recv_data_from = (rank - 2 - step + size) % size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this value is negative for rank 0 on last step
last step recv_data_from = (rank - 2 - (size - 1) + size) % size = (rank - 1) % size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it never gets there. on last step the loop will break at line 132.
} | ||
|
||
max_segcount = ucc_buffer_block_count(count, size, 0); | ||
//TODO need less mem - dvivide by n_frags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to divide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
max_segcount = ucc_buffer_block_count(count, size, 0); | ||
//TODO need less mem - dvivide by n_frags | ||
to_alloc = max_segcount + (size > 2 ? 2 * max_segcount : 0); | ||
status = ucc_mc_alloc(&task->reduce_scatter_ring.scratch_mc_header, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to allocate one chunk of memory and give portion of it to each subset rather than allow each subset allocate it's own scratch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. done.
"else - in result", | ||
ucc_offsetof(ucc_tl_ucp_lib_config_t, reduce_avg_pre_op), | ||
UCC_CONFIG_TYPE_BOOL}, | ||
|
||
{"REDUCE_SCATTER_RING_BIDIRECTIONAL", "y", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some perf data of unidirectional vs bidirectional? should we add bidirectional implementation for other ring algorithms e.g. allgather?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to PR desc. Maybe it worth extending ag as well at some point. It is not always beneficial (e.g., if ring covers CPU SHM then there is no gain, since shm is not bibw).
} | ||
for (i = 0; i < map.ep_num; i++) { | ||
r = (ucc_rank_t)ucc_ep_map_eval(map, i); | ||
*((ucc_rank_t *)PTR_OFFSET(inv.array.map, sizeof(ucc_rank_t) * r)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just inv.array.map[r] = i ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map is void* inside array (since we support it to be 32 and 64 bits)
test/mpi/test_reduce_scatterv.cc
Outdated
return status; | ||
} | ||
} | ||
if (inplace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. remove currently. will add back, when we add MPI-like inplace support for RS and RSv.
_buf; \ | ||
}) | ||
|
||
#define GET_DT(_args) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions seem repetitive to what you did in allgather_knomial.c - perhaps make a more general function in tl_ucp_coll.h?
b28fb4e
to
3d1bd26
Compare
adjust count depending on comm size add reduce_scatterV
3d1bd26
to
99d3a68
Compare
bot:retest |
99d3a68
to
9213fae
Compare
i will split PR into smaller pieces |
What
TL/UCP
Adds implementation of knomial reduce-scatter, allgather, reduce-scatterv, allgatherv
Adds ring reduce_scatter
Selection between algs
Tests
Why ?
Generalize existing knomial pattern for other use-cases - expose to user (not only for sra kn allreduce)
ReduceScatter Ring Bi- vs Uni- directional data: host mem, 8 jazz nodes, ppn=1: