Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak with CTC training script on Chinese languages #30358

Merged
merged 2 commits into from
May 2, 2024

Conversation

lucky-bai
Copy link
Contributor

@lucky-bai lucky-bai commented Apr 19, 2024

This PR fixes a memory leak that occurs during the evaluation phase with the CTC training script when attempting to train a CTC model on a language with a very large vocabulary size. I ran into this issue while training for Chinese, but it may affect other languages as well. The problem arises because all the logits are accumulated on the GPU when running the evaluation, even though it is batched, so all the logits for the entire evaluation set are accumulated in the GPU. This is not too much of a problem for most languages, but it quickly runs out of CUDA memory when the vocabulary size is as large as it is in the case for Chinese.

Some community discussion of this problem here: https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941

The bug may be reproduced using this command:

python run_speech_recognition_ctc.py \
	--dataset_name="mozilla-foundation/common_voice_11_0" \
        --model_name_or_path="facebook/wav2vec2-xls-r-300m" \
	--tokenizer_name_or_path="jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn" \
	--dataset_config_name="yue" \
	--output_dir="./wav2vec2-common_voice-tr-demo" \
	--overwrite_output_dir \
	--num_train_epochs="15" \
	--per_device_train_batch_size="16" \
	--gradient_accumulation_steps="2" \
	--learning_rate="3e-4" \
	--warmup_steps="500" \
	--text_column_name="sentence" \
	--length_column_name="input_length" \
	--save_steps="400" \
	--eval_steps="100" \
	--layerdrop="0.0" \
	--save_total_limit="3" \
	--freeze_feature_encoder \
	--gradient_checkpointing \
	--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
	--fp16 \
	--group_by_length \
	--push_to_hub \
	--do_eval 

cc @sanchit-gandhi

lucky-bai and others added 2 commits April 19, 2024 15:10

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @lucky-bai! cc @kamilakesbi

@sanchit-gandhi sanchit-gandhi requested a review from amyeroberts May 1, 2024 17:21
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this and fixing!

@amyeroberts amyeroberts merged commit 12c5544 into huggingface:main May 2, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants