Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang when convolution layers have unused bias weights #2074

Open
timmoon10 opened this issue Mar 8, 2022 · 0 comments
Open

Hang when convolution layers have unused bias weights #2074

timmoon10 opened this issue Mar 8, 2022 · 0 comments
Labels

Comments

@timmoon10
Copy link
Collaborator

timmoon10 commented Mar 8, 2022

The current issue is mitigated by #2073. It now takes active effort to create unused bias weights in the convolution and fully-connected layers. This issue is a record in case we run into a similar issue in the future or if we refactor the weights class.

Description

@samadejacobs has observed hangs when training models with many convolution layers. On Lassen, he sees that either all even ranks or all odd ranks get stuck in an asynchronous allreduce in model::reconcile_weights:

for (auto& req : reqs) { m_comm->wait(req); }

@benson31 has reproduced the hang on Pascal (edit: it was on Lassen), although without the even/odd rank behavior. He observes that several non-blocking allreduces on CPU data return invalid request objects instead of the expected MPI_REQUEST_NULL.

Minimal reproducer on Lassen
import lbann
import lbann.modules
import lbann.contrib.launcher

# ----------------------------------
# Construct layer graph
# ----------------------------------

num_layers = 128
x = lbann.Reshape(
    lbann.Input(data_field='samples'),
    dims='1 1 1'
)
for i in range(num_layers):
    conv = lbann.modules.Convolution2dModule(
	1,
        1,
        bias=False,
        weights=[lbann.Weights(), lbann.Weights()],
    )
    x = conv(x)

# ----------------------------------
# Dummy input
# ----------------------------------

reader = lbann.reader_pb2.DataReader()
_reader = reader.reader.add()
_reader.name = 'synthetic'
_reader.role = 'train'
_reader.num_samples = 4
_reader.num_labels = 1
_reader.synth_dimensions = '1'
_reader.percent_of_data_to_use = 1.0
_reader = reader.reader.add()
_reader.name = 'synthetic'
_reader.role = 'validate'
_reader.num_samples = 4
_reader.num_labels = 1
_reader.synth_dimensions = '1'
_reader.percent_of_data_to_use = 1.0

# ----------------------------------
# Setup experiment
# ----------------------------------

# Setup model
model = lbann.Model(
    2,
    layers=lbann.traverse_layer_graph([x]),
    objective_function=x,
    callbacks=[
        lbann.CallbackPrint(),
        lbann.CallbackTimer(),
    ],
)

# Setup optimizer
opt = lbann.SGD()

# Setup trainer
trainer = lbann.Trainer(
    mini_batch_size=4,
)

# ----------------------------------
# Run experiment
# ----------------------------------

lbann.contrib.launcher.run(
    trainer, model, reader, opt,
)

Interestingly, the hang shows up with num_layers=128 but not with num_layers=127.

Proposal

I think this is happening because we are constructing weights objects that are not properly configured before the setup stage. In particular, the weights dims and data distribution is set in layer::setup_data. The convolution layer can accept two weights objects, but doesn't configure the second one if bias is disabled.

I think the best solution would be to force the user (or Python front-end) to fully and explicitly configure the dims and distribution of any weights objects they create. If the user doesn't provide weights, the layers can just create a new one with the right configuration, so this is actually not much less convenient than our current approach. This will remove the messy two-way interaction between weights and layers, allow for weights that are not owned by any layers, and be more convenient for importing weights and sub-grid/sub-graph parallelism.

@timmoon10 timmoon10 added the bug label Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant