This will let you bulk transcribe audio files using a cloud provider of your choice. The project is using terraform
to create a number of instances and uses ansible
to configure and transcribe the files in parallel using whisper.
You should really use a cloud provider which supports GPU's. Even on instances with 16 CPU's the transcribe process is horribly slow
of course you can use a service like replicate, I will have to see what costs like a bulk transcripts would cost on replicate and than compare it
Also, some general remarks can be found here testing.
This project has been testing with the following versions:
- Terraform 1.5.7
- Ansible 2.16.0 (9.0.x)
- Python 3.11.6
- openstack client 6.0.0
In order to use this project, first create your config files as described in the section below.
Usage: ./whisper-autotranscription.sh [-f CONFIGFILE] [-n NUMBER VMS] [-m MODE] [-h]
-f CONFIGFILE Specify a config file (optional. will use config/config.sh if not specified))
-n NUMVMS Specify a number of VMS to create (optional. will use 1 if not specified))
-m MODE Specify the mode whisper|whisperx (optional. will use whisper if not specified)
-h Display this help message
There is also the possibility to use whisperX instead of whisper
.
For this you need a huggingface account and a token.
You will also need to accept the Terms and Conditions for speaker-diarization and segmentation.
whisperX only works with .wav files currently because of a bug in python-soundfile
Copy config/ansible_secrets.yaml_example
to config/ansible_secrets.yaml
and add you token to config/ansible_secrets.yaml
.
Files that need to be processed need to be put in files_upload
directory $SRC_DIR
. After the transcription the files will first be downloaded to files_download
directory $DST_DIR
and then copied to the originating directories in files_upload
or $SRC_DIR
.
If the variable CLEANUP
is set to true
, the files in files_download
will be deleted.
The file config/config.sh_example
needs to be copied over to config/config.sh
cp config/config.sh_example config/config.sh
Adjust the values according to your needs.
The terraform tfvars file config/variables.tfvars_example
needs to be copied over to config/variables.tfvars
cp config/variables.tfvars_example config/variables.tfvars
Adjust the values according to your needs.
The file templates/ansible_vars.yaml_example
needs to be copied over to templates/ansible_vars.yaml
.
cp templates/ansible_vars.yaml_example templates/ansible_vars.yaml
In the templates/ansible_vars.yaml
file the model size can be set and also the path to download the files to. This needs to be the same as the DST_DIR
from the config/config.sh
.
DO NOT CHANGE the variable for THREADS, this is done in the whisper-autotranscription.sh
script which will get the value according to instance_type
Change whisper_parameters
if you want to optimize your whisper settings.
instance_threads: THREADS
whisper_model_size: "medium"
whisper_retry_count: 3
whisper_retry_delay: 10
file_directory: "/pathto/whisper-autotranscription/files_download"
whisper_parameters: "--language de --extend_duration 0.1"
If you are unsure what parameters for whisper
exist, install whisper on a system and execute whisper --help
.
The file config/secrets.sh_example
needs to be copied over to config/secrets.sh
.
cp config/secrets.sh_example config/secrets.sh
Edit the file and add the API Token(s) of your Cloud Provider
DO_TOKEN=
HCLOUD_TOKEN=
LINODE_TOKEN=
OVH_APPLICATION_KEY=
OVH_APPLICATION_SECRET=
OVH_CONSUMER_KEY=
For GCP
use gcloud auth login
in order to use terraform.
- Provision multiple VMs for parallel processing
- Supported Cloud Providers
- Hetzner Cloud (mostly used for testing)
- OVH (GPU)
- GCP (GPU) (using spot instances)
- Use OpenAI Whisper
- Upload/Download files from/to local filsystem
- Autodetect language
- GPU instance support with Nvidia Cuda
- use whisperX instead of whisper
- more CLI script parameters to reduce the config file mess
- option for maximum number of files to transcribe
- Obsidian audio-notes plugin support
- automatic translation with DeepL to a specified language for transcripts
- upload only files from files_upload that have not been transcribed
- use rclone directly on the remote system without any local files
- automatically create summaries for transcripts
- whisperX diarization (currently not so great)
- Speaker Identification
- Supported Cloud Providers
- Azure (GPU) (wont implement - feel free to fork)
- Linode (GPU) (not yet fully tested since I did not get any GPU instance access) (wont implement - feel free to fork)
- AWS (GPU) (wont implement - feel free to fork)
- Use DeepL Write API to automatically correct grammar
- Create Cloud Images with Packer for faster deployment
Feel free to fork and open up a pull request either to fix errors or add functionality.