Transcribe, translate, diarize, annotate and subtitle audio and video files with Whisper ... fast!
whisply
combines faster-whisper, insanely-fast-whisper and batch processing of files (with mixed languages). It also enables speaker detection and annotation via pyannote.
Supported output formats: .json
.txt
.srt
.rttm
- FFmpeg
- python3.11
If you want to use a GPU:
- nvidia GPU (CUDA)
- Metal Performance Shaders (MPS) → Mac M1-M3
If you want to use speaker detection / diarization:
1. Install ffmpeg
--- macOS ---
brew install ffmpeg
--- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg
--- Windows ----
https://ffmpeg.org/download.html
2. Clone this repository and change to project folder
git clone https://github.com/th-schmidt/whisply.git
cd whisply
3. Create a Python virtual environment
python3.11 -m venv venv
4. Activate the Python virtual environment
source venv/bin/activate
5. Install whisply
with pip
pip install .
>>> whisply --help
Usage: whisply [OPTIONS]
WHISPLY 🗿 Transcribe, translate, diarize, annotate and subtitle audio and
video files with Whisper ... fast!
Options:
--files PATH Path to file, folder, URL or .list to process.
--output_dir DIRECTORY Folder where transcripts should be saved. Default:
"./transcriptions"
--device [cpu|gpu|mps] Select the computation device: CPU, GPU (nvidia
CUDA), or MPS (Metal Performance Shaders).
--lang TEXT Specifies the language of the file your providing
(en, de, fr ...). Default: auto-detection)
--detect_speakers Enable speaker diarization to identify and separate
different speakers. Creates .rttm file.
--hf_token TEXT HuggingFace Access token required for speaker
diarization.
--translate Translate transcription to English.
--srt Create .srt subtitles from the transcription.
--txt Create .txt with the transcription.
--config FILE Path to configuration file.
--list_formats List supported audio and video formats.
--verbose Print text chunks during transcription.
--help Show this message and exit.
To use the --detect_speakers
option, you need to provide a valid HuggingFace access token by using the --hf_token
option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the pyannote
segmentation model. For detailed instructions, refer to the Requirements section on the pyannote model page on HuggingFace.
You can provide a .json config file by using the --config
which makes processing more user-friendly. An example config looks like this:
{
"files": "path/to/files",
"output_dir": "./transcriptions",
"device": "cpu",
"lang": null,
"detect_speakers": false,
"hf_token": "Hugging Face Access Token",
"translate": true,
"txt": true,
"srt": false,
"verbose": true
}
Instead of providing a file, folder or URL by using the --files
option, you can pass a .list
with a mix of files, folders and URLs for processing. Example:
cat my_files.list
video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo