You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-02-23 03:10:18.814529: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-23 03:10:18.814595: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-23 03:10:18.816594: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-23 03:10:20.392356: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Traceback (most recent call last):
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 720, in
main()
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 375, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 129, in init
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1442, in post_init
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1887, in device
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1813, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 171, in init
assert (
AssertionError: DeepSpeed is not available => install it using pip3 install deepspeed or build it from source
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60462) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
2024-02-23 03:10:18.814529: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-23 03:10:18.814595: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-23 03:10:18.816594: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-23 03:10:20.392356: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Traceback (most recent call last):
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 720, in
main()
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 375, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 129, in init
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1442, in post_init
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1887, in device
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1813, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 171, in init
assert (
AssertionError: DeepSpeed is not available => install it using
pip3 install deepspeed
or build it from sourceERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60462) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_clm_pt_with_peft.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-02-23_03:10:28
host : cc2c86357935
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60462)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Beta Was this translation helpful? Give feedback.
All reactions