Zero progress at training #1090

sapozhnikov · 2024-02-13T20:03:57Z

Describe the bug

Trying to run SVC locally and get GPU acceleration from Radeon 5700XT. During training progress stuck at the beginning, GPU run at 100%, but only get 'Epoch 0/9999' hours after.

To Reproduce

Installed to fresh Conda environment, like described. Python 3.10.13

python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6
pip install -U so-vits-svc-fork

Additional context

radeontop showing 100% at 'Graphics pipe' & 'Shader Interpolator' bars during hubert and train stages.
Tried different versions of pytorch. Same behavior with latest. Older versions fails to open model, i think.

Added to ~/.bashrc

export ROCM_PATH=/opt/rocm
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export PYTORCH_ROCM_ARCH="gfx1010"

System:
Kernel: 6.7.4-arch1-1 arch: x86_64 bits: 64 compiler: gcc v: 13.2.1 clocksource: tsc
Desktop: Cinnamon v: 6.0.4 tk: GTK v: 3.24.41 wm: Muffin v: 6.0.1 vt: 7 dm: LightDM v: 1.32.0
Distro: EndeavourOS base: Arch Linux
CPU:
Info: 14-core model: Intel Xeon E5-2680 v4 bits: 64 type: MT MCP smt: enabled arch: Broadwell
rev: 1 cache: L1: 896 KiB L2: 3.5 MiB L3: 35 MiB
Graphics:
Device-1: AMD Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] driver: amdgpu v: kernel
arch: RDNA-1 pcie: speed: 16 GT/s lanes: 16 ports: active: HDMI-A-1 empty: DP-1,DP-2,DP-3
bus-ID: 05:00.0 chip-ID: 1002:731f class-ID: 0300
Info:
Memory: total: 32 GiB note: est. available: 31.2 GiB used: 3.16 GiB (10.1%)

svc output:

22:59:42] INFO [22:59:42] Using strategy: auto train.py:98
INFO: GPU available: True (cuda), used: True
INFO [22:59:42] GPU available: True (cuda), used: True rank_zero.py:64
INFO: TPU available: False, using: 0 TPU cores
INFO [22:59:42] TPU available: False, using: 0 TPU cores rank_zero.py:64
INFO: IPU available: False, using: 0 IPUs
INFO [22:59:42] IPU available: False, using: 0 IPUs rank_zero.py:64
INFO: HPU available: False, using: 0 HPUs
INFO [22:59:42] HPU available: False, using: 0 HPUs rank_zero.py:64
WARNING [22:59:42] warnings.py:109
/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/modules/synthesizers.py:8
1: UserWarning: Unused arguments: {'n_layers_q': 3, 'use_spectral_norm': False, 'pretrained': {'D_0.pth':
'https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/c
lean_D_320000.pth', 'G_0.pth':
'https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/c
lean_G_320000.pth'}}
warnings.warn(f"Unused arguments: {kwargs}")
       INFO     [22:59:42] Decoder type: hifi-gan                                                                       synthesizers.py:100
       WARNING  [22:59:42]                                                                                                  warnings.py:109
                /home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28:                         
                UserWarning: torch.nn.utils.weight_norm is deprecated in favor of                                                          
                torch.nn.utils.parametrizations.weight_norm.                                                                               
                  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of                                                      
                torch.nn.utils.parametrizations.weight_norm.")
[22:59:44] WARNING [22:59:44] /home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/utils.py:246: warnings.py:109
UserWarning: Keys not found in checkpoint state dict:['emb_g.weight']
warnings.warn(f"Keys not found in checkpoint state dict:" f"{not_in_from}")
       WARNING  [22:59:44] /home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/utils.py:264:  warnings.py:109
                UserWarning: Shape mismatch: ['dec.cond.weight: torch.Size([512, 256, 1]) -> torch.Size([512, 768, 1])',                   
                'enc_q.enc.cond_layer.weight_v: torch.Size([6144, 256, 1]) -> torch.Size([6144, 768, 1])',                                 
                'flow.flows.0.enc.cond_layer.weight_v: torch.Size([1536, 256, 1]) -> torch.Size([1536, 768, 1])',                          
                'flow.flows.2.enc.cond_layer.weight_v: torch.Size([1536, 256, 1]) -> torch.Size([1536, 768, 1])',                          
                'flow.flows.4.enc.cond_layer.weight_v: torch.Size([1536, 256, 1]) -> torch.Size([1536, 768, 1])',                          
                'flow.flows.6.enc.cond_layer.weight_v: torch.Size([1536, 256, 1]) -> torch.Size([1536, 768, 1])',                          
                'f0_decoder.cond.weight: torch.Size([192, 256, 1]) -> torch.Size([192, 768, 1])']                                          
                  warnings.warn(                                                                                                           
                                                                                                                                           
       INFO     [22:59:44] Loaded checkpoint 'logs/44k/G_0.pth' (epoch 0)                                                      utils.py:307
       INFO     [22:59:44] Loaded checkpoint 'logs/44k/D_0.pth' (epoch 0)                                                      utils.py:307
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO [22:59:44] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] cuda.py:61
┏━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃
┡━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ net_g │ SynthesizerTrn │ 45.6 M │
│ 1 │ net_d │ MultiPeriodDiscriminator │ 46.7 M │
└───┴───────┴──────────────────────────┴────────┘
Trainable params: 92.4 M
Non-trainable params: 0
Total params: 92.4 M
Total estimated model params size (MB): 369
WARNING [22:59:44] warnings.py:109
/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_
connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider
increasing the value of the num_workers argumenttonum_workers=27in theDataLoader` to improve
performance.

[22:59:45] INFO [22:59:45] Setting current epoch to 0 train.py:311
INFO [22:59:45] Setting total batch idx to 0 train.py:327
INFO [22:59:45] Setting global step to 0 train.py:317
Epoch 0/9999 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/156 0:00:00 • -:--:-- 0.00it/s v_num: 0.000

Version

4.1.47

Platform

EndeavourOS (Arch Linux)

Code of Conduct

I agree to follow this project's Code of Conduct.

No Duplicate

I have checked existing issues to avoid duplicates.

The text was updated successfully, but these errors were encountered:

sapozhnikov · 2024-02-13T23:01:15Z

Well, it looks like pytorch doesn't work with my GPU, last compatible version of ROCm was 5.2, which doesn't work with SVC and produce

INFO [02:10:15] Decoder type: hifi-gan synthesizers.py:100
Traceback (most recent call last):
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/train.py", line 347, in load
_, _, _, epoch = utils.load_checkpoint(
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/utils.py", line 288, in load_checkpoint
checkpoint_dict = torch.load(f, map_location="cpu", weights_only=True)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 71

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/user01/miniconda3/envs/sovits/bin/svc", line 8, in
sys.exit(cli())
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/main.py", line 128, in train
train(
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/train.py", line 119, in train
model = VitsLightning(reset_optimizer=reset_optimizer, **hparams)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/train.py", line 186, in init
self.load(reset_optimizer)
File "/home/user01/miniconda3/envs/sovits/lib/python3.10/site-packages/so_vits_svc_fork/train.py", line 363, in load
raise RuntimeError("Failed to load checkpoint") from e
RuntimeError: Failed to load checkpoint

sapozhnikov added the bug Something isn't working label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero progress at training #1090

Zero progress at training #1090

sapozhnikov commented Feb 13, 2024

sapozhnikov commented Feb 13, 2024 •

edited

Zero progress at training #1090

Zero progress at training #1090

Comments

sapozhnikov commented Feb 13, 2024

Describe the bug

To Reproduce

Additional context

Version

Platform

Code of Conduct

No Duplicate

sapozhnikov commented Feb 13, 2024 • edited

sapozhnikov commented Feb 13, 2024 •

edited