SIGKILL on save steps #2722

emillykkejensen · 2024-04-29T10:57:57Z

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I'm trying to fine tune llama3-8b on AWS g5.12xl with 4 A10g. However, when I try to save the model as part of save steps in training_config, it breaks. It seems it is out of memory? How could I resolve that?

Error log:

[INFO|configuration_utils.py:471] 2024-04-29 10:02:05,317 >> Configuration saved in /tmp/tmpx8kp8fha/config.json                                                          
[INFO|configuration_utils.py:705] 2024-04-29 10:02:05,317 >> Configuration saved in /tmp/tmpx8kp8fha/generation_config.json                                               
[INFO|modeling_utils.py:2592] 2024-04-29 10:02:05,331 >> Model weights saved in /tmp/tmpx8kp8fha/model.safetensors                                                        
{'loss': 1.4653, 'grad_norm': 5.406085014343262, 'learning_rate': 3.2520325203252037e-06, 'epoch': 0.032679738562091505}                                                  
{'loss': 1.25, 'grad_norm': 4.077666282653809, 'learning_rate': 6.504065040650407e-06, 'epoch': 0.06535947712418301}                                                      
{'loss': 1.1815, 'grad_norm': 4.065021514892578, 'learning_rate': 9.756097560975611e-06, 'epoch': 0.09803921568627451}                                                    
{'loss': 1.1615, 'grad_norm': 3.875983953475952, 'learning_rate': 1.3008130081300815e-05, 'epoch': 0.13071895424836602}                                                   
{'loss': 1.1675, 'grad_norm': 4.254099369049072, 'learning_rate': 1.6260162601626018e-05, 'epoch': 0.16339869281045752}                                                   
[INFO|trainer.py:3351] 2024-04-29 10:37:40,150 >> Saving model checkpoint to ./Llama-3-instruct-dansk/checkpoint-100
[INFO|configuration_utils.py:471] 2024-04-29 10:37:40,153 >> Configuration saved in ./Llama-3-instruct-dansk/checkpoint-100/config.json
[INFO|configuration_utils.py:705] 2024-04-29 10:37:40,153 >> Configuration saved in ./Llama-3-instruct-dansk/checkpoint-100/generation_config.json
[2024-04-29 10:37:48,005] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 17528 closing signal SIGTERM
[2024-04-29 10:37:48,005] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 17529 closing signal SIGTERM
[2024-04-29 10:37:48,005] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 17530 closing signal SIGTERM
[2024-04-29 10:37:51,589] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 17527) of binary: /opt/conda/envs/pytorch/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-29_10:37:48
  host      : ip-142-222-58-21.eu-central-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 17527)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 17527
============================================================

Expected behavior

Save doing training

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-04-29T17:36:18Z

cc @pacman100

emillykkejensen · 2024-04-30T11:26:29Z

So it's due to system memory running out (and then when it tries to save, it crashes).

Setting offload_param_device to None, didn't help. But setting zero3_init_flag to false, did. When running the above with zero3_init_flag to false it frees up just enough RAM to be able to save the model?

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: none
  zero3_init_flag: false
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

github-actions · 2024-05-29T15:06:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGKILL on save steps #2722

SIGKILL on save steps #2722

emillykkejensen commented Apr 29, 2024

muellerzr commented Apr 29, 2024

emillykkejensen commented Apr 30, 2024 •

edited

github-actions bot commented May 29, 2024

SIGKILL on save steps #2722

SIGKILL on save steps #2722

Comments

emillykkejensen commented Apr 29, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Apr 29, 2024

emillykkejensen commented Apr 30, 2024 • edited

github-actions bot commented May 29, 2024

emillykkejensen commented Apr 30, 2024 •

edited