Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending LBANN Distconv Interface #2133

Open
szaman19 opened this issue Aug 9, 2022 · 0 comments
Open

Extending LBANN Distconv Interface #2133

szaman19 opened this issue Aug 9, 2022 · 0 comments

Comments

@szaman19
Copy link
Collaborator

szaman19 commented Aug 9, 2022

The LBANN Distconv adapter for layers mandates that only the first input tensor to distconv-enabled layer can be a non-DiHydrogen tensor. We raise an error if a tensor requires a copy to a DiHydrogen tensor. The following checks are done:

if (index != 0) LBANN_ERROR("Copyin of non-first tensor not supported yet");

LBANN_ERROR("Copyin non-first tensor not supported");

LBANN_ERROR(layer().get_name(), ": copyin of non-first tensor not supported");

LBANN_ERROR(layer().get_name(), ": Copyout of non-first tensor not supported");

LBANN_ERROR(layer().get_name(), ": copyin of non-first tensor not supported");

LBANN_ERROR(layer().get_name(), ": Copyout of non-first tensor not supported");

While these worked for the original DC layers (Convolution, MSE, ReLU), mewer DC layers such as Scatter, Gather, and MatMul generally have more than one input that may need to be copied to DiHydrogen tensors, so ideally we should support the case for multiple parent tensors requiring copy. Simply removing the checks resulted in failing CI tests.

Possible workaround with Identity layer as a copy layer also has issues: #2126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant