Fix memory leak with CTC training script on Chinese languages #30358

lucky-bai · 2024-04-19T22:19:00Z

This PR fixes a memory leak that occurs during the evaluation phase with the CTC training script when attempting to train a CTC model on a language with a very large vocabulary size. I ran into this issue while training for Chinese, but it may affect other languages as well. The problem arises because all the logits are accumulated on the GPU when running the evaluation, even though it is batched, so all the logits for the entire evaluation set are accumulated in the GPU. This is not too much of a problem for most languages, but it quickly runs out of CUDA memory when the vocabulary size is as large as it is in the case for Chinese.

Some community discussion of this problem here: https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941

The bug may be reproduced using this command:

python run_speech_recognition_ctc.py \
	--dataset_name="mozilla-foundation/common_voice_11_0" \
        --model_name_or_path="facebook/wav2vec2-xls-r-300m" \
	--tokenizer_name_or_path="jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn" \
	--dataset_config_name="yue" \
	--output_dir="./wav2vec2-common_voice-tr-demo" \
	--overwrite_output_dir \
	--num_train_epochs="15" \
	--per_device_train_batch_size="16" \
	--gradient_accumulation_steps="2" \
	--learning_rate="3e-4" \
	--warmup_steps="500" \
	--text_column_name="sentence" \
	--length_column_name="input_length" \
	--save_steps="400" \
	--eval_steps="100" \
	--layerdrop="0.0" \
	--save_total_limit="3" \
	--freeze_feature_encoder \
	--gradient_checkpointing \
	--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
	--fp16 \
	--group_by_length \
	--push_to_hub \
	--do_eval

cc @sanchit-gandhi

sanchit-gandhi

LGTM - thanks @lucky-bai! cc @kamilakesbi

HuggingFaceDocBuilderDev · 2024-05-01T17:47:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for digging into this and fixing!

lucky-bai and others added 2 commits April 19, 2024 15:10

Fix memory leak with CTC training script on Chinese languages

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

731bbf6

Fix lint

4c7e327

sanchit-gandhi approved these changes May 1, 2024

View reviewed changes

sanchit-gandhi requested a review from amyeroberts May 1, 2024 17:21

amyeroberts approved these changes May 2, 2024

View reviewed changes

amyeroberts merged commit 12c5544 into huggingface:main May 2, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak with CTC training script on Chinese languages #30358

Fix memory leak with CTC training script on Chinese languages #30358

lucky-bai commented Apr 19, 2024 •

edited

Loading

sanchit-gandhi left a comment

HuggingFaceDocBuilderDev commented May 1, 2024

amyeroberts left a comment

Fix memory leak with CTC training script on Chinese languages #30358

Fix memory leak with CTC training script on Chinese languages #30358

Conversation

lucky-bai commented Apr 19, 2024 • edited Loading

sanchit-gandhi left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 1, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

lucky-bai commented Apr 19, 2024 •

edited

Loading