Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/tl ucp kn algs #361

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

vspetrov
Copy link
Collaborator

@vspetrov vspetrov commented Dec 6, 2021

What

TL/UCP
Adds implementation of knomial reduce-scatter, allgather, reduce-scatterv, allgatherv
Adds ring reduce_scatter
Selection between algs
Tests

Why ?

Generalize existing knomial pattern for other use-cases - expose to user (not only for sra kn allreduce)

ReduceScatter Ring Bi- vs Uni- directional data: host mem, 8 jazz nodes, ppn=1:

[~/workspace/ucc/build_rel]
[valentinp@hpchead]> mpirun -x UCX_NET_DEVICES=mlx5_0:1 -x UCC_TL_UCP_REDUCE_SCATTER_RING_BIDIRECTIONAL=y -x UCC_TL_UCP_TUNE=reduce_scatter:@ring -x UCC_TLS=ucp -np 8 --map-by node ./install/bin/ucc_perftest -c reduce_scatter -b 8192 -e $((2**20))
Collective:             ReduceScatter
Memory type:            host
Datatype:               float32
Reduction:              sum
Inplace:                0
Warmup:
  small                 100
  large                 20
Iterations:
  small                 1000
  large                 200

       Count        Size                Time, us
                                 avg         min         max
        8192       32768       53.07       52.87       53.31
       16384       65536       86.33       85.85       87.37
       32768      131072      147.08      146.04      149.74
       65536      262144      271.16      270.32      272.31
      131072      524288      544.31      543.22      546.50
      262144     1048576     1133.39     1121.06     1140.96
      524288     2097152     2328.61     2311.57     2344.70
     1048576     4194304     4800.29     4766.70     4815.62

[~/workspace/ucc/build_rel]
[valentinp@hpchead]> mpirun -x UCX_NET_DEVICES=mlx5_0:1 -x UCC_TL_UCP_REDUCE_SCATTER_RING_BIDIRECTIONAL=n -x UCC_TL_UCP_TUNE=reduce_scatter:@ring -x UCC_TLS=ucp -np 8 --map-by node ./install/bin/ucc_perftest -c reduce_scatter -b 8192 -e $((2**20))
Collective:             ReduceScatter
Memory type:            host
Datatype:               float32
Reduction:              sum
Inplace:                0
Warmup:
  small                 100
  large                 20
Iterations:
  small                 1000
  large                 200

       Count        Size                Time, us
                                 avg         min         max
        8192       32768       66.90       66.21       67.46
       16384       65536      103.88      100.91      108.14
       32768      131072      175.62      167.65      183.85
       65536      262144      305.82      304.03      307.39
      131072      524288      626.65      601.01      637.28
      262144     1048576     1254.95     1228.15     1288.78
      524288     2097152     2546.16     2505.00     2576.32
     1048576     4194304     5181.85     5131.23     5211.40

broot = coll_args->args.root;
rank = VRANK(rank, broot, size);
if (coll_args->args.coll_type == UCC_COLL_TYPE_ALLGATHER) {
ucc_kn_ag_pattern_init(size, rank, radix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern init is also part of ucc_tl_ucp_allgather_knomial_start function, do we really need to do it twice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks you are right - not needed here. copy-pasted from reduce-scatter knomial. there pattern init is called there because i need to understand which ranks are extra/proxy - this info is part of kn pattern

src/components/tl/ucp/tl_ucp_coll.c Outdated Show resolved Hide resolved
} else {
count = args->dst.info.count;
start = NULL;
ucc_kn_agx_pattern_init(size, rank, radix, count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to use VRANK(rank) here for sag bcast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true - it is done above at line 201.

prevblock =
ucc_ep_map_eval(task->reduce_scatter_ring.inv_map, prevblock);
if (task->recv_completed == size - 1) {
ucc_assert(prevblock == UCC_TL_TEAM_RANK(team));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this assert? first checking prevblock is rank and assign to rank in next line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct - removed (part of debug)

team, task),
task, out);

recv_data_from = (rank - 2 - step + size) % size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this value is negative for rank 0 on last step
last step recv_data_from = (rank - 2 - (size - 1) + size) % size = (rank - 1) % size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it never gets there. on last step the loop will break at line 132.

}

max_segcount = ucc_buffer_block_count(count, size, 0);
//TODO need less mem - dvivide by n_frags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to divide?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

max_segcount = ucc_buffer_block_count(count, size, 0);
//TODO need less mem - dvivide by n_frags
to_alloc = max_segcount + (size > 2 ? 2 * max_segcount : 0);
status = ucc_mc_alloc(&task->reduce_scatter_ring.scratch_mc_header,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to allocate one chunk of memory and give portion of it to each subset rather than allow each subset allocate it's own scratch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. done.

"else - in result",
ucc_offsetof(ucc_tl_ucp_lib_config_t, reduce_avg_pre_op),
UCC_CONFIG_TYPE_BOOL},

{"REDUCE_SCATTER_RING_BIDIRECTIONAL", "y",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have some perf data of unidirectional vs bidirectional? should we add bidirectional implementation for other ring algorithms e.g. allgather?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to PR desc. Maybe it worth extending ag as well at some point. It is not always beneficial (e.g., if ring covers CPU SHM then there is no gain, since shm is not bibw).

}
for (i = 0; i < map.ep_num; i++) {
r = (ucc_rank_t)ucc_ep_map_eval(map, i);
*((ucc_rank_t *)PTR_OFFSET(inv.array.map, sizeof(ucc_rank_t) * r)) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just inv.array.map[r] = i ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map is void* inside array (since we support it to be 32 and 64 bits)

return status;
}
}
if (inplace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. remove currently. will add back, when we add MPI-like inplace support for RS and RSv.

src/coll_patterns/recursive_knomial.h Outdated Show resolved Hide resolved
src/components/tl/ucp/reduce_scatter/reduce_scatter_ring.c Outdated Show resolved Hide resolved
src/coll_patterns/sra_knomial.h Show resolved Hide resolved
_buf; \
})

#define GET_DT(_args) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions seem repetitive to what you did in allgather_knomial.c - perhaps make a more general function in tl_ucp_coll.h?

@vspetrov vspetrov force-pushed the topic/tl_ucp_kn_algs branch 6 times, most recently from b28fb4e to 3d1bd26 Compare January 13, 2022 15:20
@vspetrov
Copy link
Collaborator Author

bot:retest

@vspetrov
Copy link
Collaborator Author

vspetrov commented Feb 1, 2022

i will split PR into smaller pieces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants