Hang when convolution layers have unused bias weights #2074

timmoon10 · 2022-03-08T01:54:43Z

The current issue is mitigated by #2073. It now takes active effort to create unused bias weights in the convolution and fully-connected layers. This issue is a record in case we run into a similar issue in the future or if we refactor the weights class.

Description

@samadejacobs has observed hangs when training models with many convolution layers. On Lassen, he sees that either all even ranks or all odd ranks get stuck in an asynchronous allreduce in model::reconcile_weights:

lbann/src/models/model.cpp

Line 2161 in b9bb511

for (auto& req : reqs) { m_comm->wait(req); }

@benson31 has reproduced the hang on ~~Pascal~~ (edit: it was on Lassen), although without the even/odd rank behavior. He observes that several non-blocking allreduces on CPU data return invalid request objects instead of the expected MPI_REQUEST_NULL.

Minimal reproducer on Lassen

import lbann
import lbann.modules
import lbann.contrib.launcher

# ----------------------------------
# Construct layer graph
# ----------------------------------

num_layers = 128
x = lbann.Reshape(
    lbann.Input(data_field='samples'),
    dims='1 1 1'
)
for i in range(num_layers):
    conv = lbann.modules.Convolution2dModule(
	1,
        1,
        bias=False,
        weights=[lbann.Weights(), lbann.Weights()],
    )
    x = conv(x)

# ----------------------------------
# Dummy input
# ----------------------------------

reader = lbann.reader_pb2.DataReader()
_reader = reader.reader.add()
_reader.name = 'synthetic'
_reader.role = 'train'
_reader.num_samples = 4
_reader.num_labels = 1
_reader.synth_dimensions = '1'
_reader.percent_of_data_to_use = 1.0
_reader = reader.reader.add()
_reader.name = 'synthetic'
_reader.role = 'validate'
_reader.num_samples = 4
_reader.num_labels = 1
_reader.synth_dimensions = '1'
_reader.percent_of_data_to_use = 1.0

# ----------------------------------
# Setup experiment
# ----------------------------------

# Setup model
model = lbann.Model(
    2,
    layers=lbann.traverse_layer_graph([x]),
    objective_function=x,
    callbacks=[
        lbann.CallbackPrint(),
        lbann.CallbackTimer(),
    ],
)

# Setup optimizer
opt = lbann.SGD()

# Setup trainer
trainer = lbann.Trainer(
    mini_batch_size=4,
)

# ----------------------------------
# Run experiment
# ----------------------------------

lbann.contrib.launcher.run(
    trainer, model, reader, opt,
)

Interestingly, the hang shows up with num_layers=128 but not with num_layers=127.

Proposal

I think this is happening because we are constructing weights objects that are not properly configured before the setup stage. In particular, the weights dims and data distribution is set in layer::setup_data. The convolution layer can accept two weights objects, but doesn't configure the second one if bias is disabled.

I think the best solution would be to force the user (or Python front-end) to fully and explicitly configure the dims and distribution of any weights objects they create. If the user doesn't provide weights, the layers can just create a new one with the right configuration, so this is actually not much less convenient than our current approach. This will remove the messy two-way interaction between weights and layers, allow for weights that are not owned by any layers, and be more convenient for importing weights and sub-grid/sub-graph parallelism.

The text was updated successfully, but these errors were encountered:

timmoon10 added the bug label Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang when convolution layers have unused bias weights #2074

Hang when convolution layers have unused bias weights #2074

timmoon10 commented Mar 8, 2022 •

edited by benson31

Hang when convolution layers have unused bias weights #2074

Hang when convolution layers have unused bias weights #2074

Comments

timmoon10 commented Mar 8, 2022 • edited by benson31

Description

Proposal

timmoon10 commented Mar 8, 2022 •

edited by benson31