Skip to content

SpeechFlow-io/Spoken_language_identification

Repository files navigation

SpeechFlow is an advanced speech-to-text API that offers exceptional accuracy for businesses of all sizes and industries. With SpeechFlow, users can transcribe audio and video content into text with high precision, making it an ideal solution for companies that need to quickly and accurately convert speech into text for various purposes, such as captioning, transcription, and analysis. With support for multiple languages and dialects, SpeechFlow is a versatile tool that can cater to a wide range of businesses and industries.

Spoken_language_identification

Objective

Spoken Language Identification (LID) is defined as detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker. It has numerous applications in speech recognition, multilingual machine translations, and speech-to-speech translations.

Our model currently supports 13 languages: English, Spanish, Italian, French, German, Portuguese, Russian, Turkish, Vietnamese, Indonesian, Chinese, Japanese, and Korean.

Technology

The model uses convolutional and recurrent neural networks trained on two thousands of hours of speech data(private). Approximately 150 hours of speech supervision per language.


Available models and languages

The figure below shows a ACC (Accuracy) breakdown by languages of the FLEURS test-set using pretrained model.
FLEURS dataset downloads can be fount here: Downloads

Environment Setup

The models are implemented in TensorFlow. To use all of the functionality of the library, you should have:
tensorflow==2.4.1
tensorflow-gpu==2.4.1
tensorflow-addons==0.15.0
matplotlib==3.5.0
numpy==1.19.5
scikit-learn==1.0.1
librosa==0.8.1
SoundFile==0.10.3.post1
PyYAML==6.0

Download the codebase and open up a terminal in the root directory. Make sure python 3.7 is installed in the current environment. Then execute

pip install -r requirements.txt

Code Implementation

Audio Format

The wav files have 16KHz sampling rate, single channel, and 16-bit Signed Integer PCM encoding.

Features

As speech features, 80-dimensional log mel-filterbank outputs were computed from 25ms window for each 10ms. Those log mel-filterbank features were further normalized to have zero mean and unit variance over the training partition of the dataset.

Prepare your input data

You must prepare your own data before training the model, refer to 'data/demo_txt/demo_train.txt' file.

Train model

To get start, please config 'congfigs/config.yml' file, and simple run this command in the console:

python train.py

This will train Spoken_language_identification model by data in the 'data/demo_txt/demo_train.txt', then store the model on saved_weights folder, perform inference on 'demo_txt/demo_test.txt', print the inference results, and save the averaged accuracy in a text file.

Inference

 Open In Colab

The pretrained model is provided in this project, simple run this command:

python predict_by_pb.py test_audios/chinese.wav

or

python predict_by_weights.py test_audios/chinese.wav

The provided chinese.wav audio needs to meet the Audio Format, if your audio file is not wav format(eg: mp3), you can convert the audio to wav format by ffmpeg. Run the following command in your audio directory convert to wav format.

ffmpeg -i audio.mp3 -ab 256k -ar 16000 -ac 1 -f wav audio.wav

If you don't have installed ffmpeg, please installed it first.

sudo apt-get update
sudo apt-get install ffmpeg

LICENSE

Spoken_language_identification is released under the Apache License, version 2.0. The Apache license is a popular BSD-like license. Spoken_language_identification can be redistributed for free, even for commercial purposes, although you can not take off the license headers (and under some circumstances, you may have to distribute a license document).