Temporal Shift for Speech Emotion Recognition [arXiv]

Code for ICASSP 2023 paper "Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations".

Libraries and Dependencies

pytorch
torchaudio (For pretrained wav2vec2 and HuBERT)
transformers (For Specaugment)
timm (For optimizer and scheduler)
yacs (For config)

Data Preparation

Download our preprocessed wavfeature_7.5.tar.gz directly and unzip it to /dataset/IEMOCAP directory.

Or obtain IEMOCAP from USC and run

cd dataset/IEMOCAP
python Preprocess.py --path $path to directory of IEMOCAP$
cd ../..

Then you have wavfeature_7.5.pkl and each processed audio is clipped to 7.5s and samped at 16kHz.

Training

We train the model specified in our paper with the same placement/proportion of shift. It should be noted that the placement/proportion of shift and other hyperparameters (see config.py) can be adjusted flexibly.

Key arguments for easy config modification in main.py are as follows,

--model: the model chosen for training.
- cnn: convnext-like 1D CNN with 2 blocks.
- rnn: 1 layer bidirectioanl LSTM.
- transformer: 2-block Transformer with relative positional embedding by default.
--shift: whether to use temporal shift. The placement of shift are hard-coding as described in our paper: residual shift for CNN, replacement of MHSA for Transformer and in-place shift for lstm. We provide optional placement in annotation.
--ndiv: proportion of shift, namely 1/ndiv of channels will be shifted while others reamain unchanged.
--stride: the step to be shifted, kept as 1 by default.
--bidirectional: whether to use bidirectional temporal shift.
--finetune: whether to finetune the pretrained model or take the pretrained model as feature extractor. By default, we use wav2vec2 for finetuning and HuBERT for feature extraction.

For CNN and ShiftCNN

# Feture extraction for basic convnext
python main.py --model cnn
# Finetuning for basic convnext
python main.py --model cnn --finetune
# Feture extraction for ShiftCNN
python main.py --model cnn --shift --ndiv 16
# Finetuning for ShiftCNN
python main.py --model cnn --shift --ndiv 16 --finetune

For Transformer and Shiftformer

# Feture extraction for transformer
python main.py --model transformer
# Finetuning for transformer
python main.py --model transformer --finetune
# Feture extraction for Shiftformer
python main.py --model transformer --shift --ndiv 4 --bidirectional
# Finetuning for Shiftformer
python main.py --model transformer --shift --ndiv 4 --bidirectional --finetune

For LSTM and ShiftLSTM

# Feture extraction for lstm
python main.py --model rnn
# Finetuning for lstm
python main.py --model rnn --finetune
# Feture extraction for Shiftlstm
python main.py --model rnn --shift --ndiv 4
# Finetuning for Shiftlstm
python main.py --model rnn --shift --ndiv 4 --finetune

The fianl results of 5 fold is expected to be in /log.

Citation

@inproceedings{shen2023mingling,
  title={Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations},
  author={Shen, Siyuan and Liu, Feng and Zhou, Aimin},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
models		models
LICENSE		LICENSE
README.md		README.md
Shift.png		Shift.png
config.py		config.py
lr_scheduler.py		lr_scheduler.py
main.py		main.py
metrics.py		metrics.py
optimizer.py		optimizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

models

models

LICENSE

LICENSE

README.md

README.md

Shift.png

Shift.png

config.py

config.py

lr_scheduler.py

lr_scheduler.py

main.py

main.py

metrics.py

metrics.py

optimizer.py

optimizer.py

Repository files navigation

Temporal Shift for Speech Emotion Recognition [arXiv]

Libraries and Dependencies

Data Preparation

Training

Citation

About

Releases

Packages

Languages

License

ECNU-Cross-Innovation-Lab/ShiftSER

Folders and files

Latest commit

History

Repository files navigation

Temporal Shift for Speech Emotion Recognition [arXiv]

Libraries and Dependencies

Data Preparation

Training

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages