Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Add Sliding Window allreduce implementation #958

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Apr 10, 2024

This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (#902)

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from 97fa01b to 24b3d51 Compare April 10, 2024 14:56
@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 3 times, most recently from e6b3294 to 0802106 Compare April 11, 2024 14:37
@Sergei-Lebedev
Copy link
Contributor

ok to test

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 3 times, most recently from 139ee8d to f46c19d Compare April 23, 2024 14:20
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from f46c19d to 288ec3d Compare May 6, 2024 14:20
@nsarka
Copy link
Collaborator Author

nsarka commented May 8, 2024

@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

@janjust
Copy link
Collaborator

janjust commented May 9, 2024

@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR

@artemry-nv
Copy link
Collaborator

@B-a-S please take a look

@B-a-S
Copy link
Collaborator

B-a-S commented May 10, 2024

@B-a-S please take a look

I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks nice and readable, thanks! I left some comments, most importantly about memory leaks

src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 6 times, most recently from f7ff83a to 63789d9 Compare May 16, 2024 16:50
@nsarka
Copy link
Collaborator Author

nsarka commented May 16, 2024

The ucc gtest failed on

[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[2024-05-16T18:13:52.173Z] [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from 33f6db6 to 9625f72 Compare May 23, 2024 17:38
@nsarka nsarka requested a review from janjust May 29, 2024 16:05
@nsarka
Copy link
Collaborator Author

nsarka commented May 29, 2024

I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange.

Also, I moved some code to two new files, tl_ucp_dpu_offload.c and tl_ucp_dpu_offload.h. These have common code that will be used for the rest of the collectives we're planning on implementing.

@nsarka nsarka marked this pull request as draft May 29, 2024 16:16
@nsarka
Copy link
Collaborator Author

nsarka commented May 29, 2024

Please wait to review, there are some failures I should fix first.

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from 930e828 to 1d084bb Compare May 29, 2024 21:07
@artemry-nv
Copy link
Collaborator

bot:retest

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from 7953c46 to 91391f4 Compare May 30, 2024 15:25
@nsarka nsarka marked this pull request as ready for review May 30, 2024 20:39
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from 91391f4 to 417f5aa Compare May 31, 2024 20:59
@nsarka
Copy link
Collaborator Author

nsarka commented May 31, 2024

@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the comments. I still have left a couple of new minor comments

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from 8f3ed31 to 8e0381f Compare June 4, 2024 17:54
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from e755399 to 355fa4f Compare June 4, 2024 18:10
src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_dpu_offload.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_dpu_offload.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_dpu_offload.c Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from 355fa4f to 30b21f9 Compare June 5, 2024 15:21
src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce.h Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch 2 times, most recently from 3f13f77 to 2cfdab7 Compare June 6, 2024 17:56
@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from 2cfdab7 to 7609fbd Compare June 7, 2024 17:06
Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up changes:

  1. Use single allocation instead of multiple mallocs
  2. Get rid of explicit ucp request objects and use callbacks
  3. Optimize ucp_worker calls

src/components/tl/ucp/allreduce/allreduce.h Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window.c Outdated Show resolved Hide resolved
@janjust
Copy link
Collaborator

janjust commented Jun 10, 2024

@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget.

@nsarka nsarka force-pushed the nsarka/sliding-window-allreduce-final branch from 7609fbd to bed397d Compare June 10, 2024 16:47
@nsarka
Copy link
Collaborator Author

nsarka commented Jun 10, 2024

@Sergei-Lebedev @samnordmann The PR is ready to be merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants