ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers

[arXiv] This is the code repository for the ESC presented in the ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers paper.

Our neural speech codec, within only 30MB, can compress 16kHz speech to 1.5, 3, 4.5, 6, 7.5 and 9kbps efficiently while maintaining comparative reconstruction quality to Descript's audio codec.
We provide Model Checkpoints and a Demo Page

Usage

Install Dev Dependencies

pip install -r requirements.txt

To compress and decompress audio

python -m scripts.compress  --input /path/to/input.wav --save_path /path/to/output --model_path /path/to/model --num_streams 6 --device cpu

This will create .pth and .wav files (code and reconstructed audio) under save_path. Our codec supports num_streams from 1 to 6, corresponding to 1.5 ~ 9.0kbps bitrates.

import torchaudio
from models import ESC
model = ESC(**config)
model.load_state_dict(
        torch.load("model.pth", map_location="cpu")["model_state_dict"],
    )
model = model.to("cuda")
x, _ = torchaudio.load("input.wav")
x.to("cuda")
# encode to codes
codes, pshape = model.encode(x, num_streams=6)
# decode to audios
recon_x = model.decode(codes, pshape)

This is the programmatic usage of esc to compress audio tensors using torchaudio.

Training

We provide our developmental training and evaluation dataset on huggingface.

accelerate launch main.py --exp_name esc9kbps --config_path ./configs/9kbps_final.yaml --wandb_project efficient-speech-codec --lr 1.0e-4 --num_epochs 80 --num_pretraining_epochs 15 --num_devices 4 --dropout_rate 0.75 --save_path /path/to/output --seed 53

We use accelerate library to handle distributed training. Logging is processed by wandb library. With 4 NVIDIA RTX4090 GPUs, training an ESC codec requires ~12h for 250k training steps on 180k 3-second audio clips with a batch size of 36. For detailed configurations, please refer to ./configs/ folder.

Evaluation

python -m scripts.test --eval_folder_path path/to/data --batch_size 12 --model_path /path/to/model --device cuda

This will run codec evaluation at all bandwidth on a test set folder. We provide four metrics for reporting: PESQ, Mel Distance, SI-SDR and Bitrate Utilization Rate. The evaluation statistics will be saved into model_path by default.

Results

We provide a performance comparison with Descript's audio codec (DAC) at different scales of model sizes.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
assets		assets
baselines/descript		baselines/descript
configs		configs
models		models
modules		modules
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

baselines/descript

baselines/descript

configs

configs

models

models

modules

modules

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers

Usage

Install Dev Dependencies

To compress and decompress audio

Training

Evaluation

Results

About

Packages

Contributors 3

Languages

License

yzGuu830/efficient-speech-codec

Folders and files

Latest commit

History

Repository files navigation

ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers

Usage

Install Dev Dependencies

To compress and decompress audio

Training

Evaluation

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Languages