Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Replace Scorer.KenLM with Scorer.Transform #2348

Open
wasertech opened this issue Mar 2, 2023 · 18 comments
Open

Feature request: Replace Scorer.KenLM with Scorer.Transform #2348

wasertech opened this issue Mar 2, 2023 · 18 comments
Labels
enhancement New feature or request

Comments

@wasertech
Copy link
Collaborator

wasertech commented Mar 2, 2023

Let’s face it. KenLM has served us well…
…but it has its limitations. It didn’t aged well as a language model architecture.

First order of business is to compute a bi directional vector representation of words to go from an audio representation to a character representation.

For example word2vec allows you to take any word and get its relative vector towards all others.

Nowadays we can use a small transformer to achieve this.

Let’s train a transformer on the raw output of our acoustic models and teach them to produce an accurate character representation of our spoken words.

This is much smarter than using KenLM and doesn’t need to be more computationally expensive if we scale our transformer accordingly.

@wasertech wasertech added the enhancement New feature or request label Mar 2, 2023
@wasertech
Copy link
Collaborator Author

wasertech commented Mar 2, 2023

The transformer should be small enough that you can train one from scratch with the amount of data required to build a scorer with KenLM.

This will allows us to load it faster for RTU.

We can use computationally expensive but accurate TTS models (like VTCK/VITS) to produce a baseline audio representation of our LM sentences. Feed them to the acoustic model and train our transformer like this.

This will keep the separation needed between the acoustic model and the scorer.

A good example would be wav2vec2 (Facebook 2020).

https://arxiv.org/pdf/2006.11477.pdf

@wasertech
Copy link
Collaborator Author

wasertech commented Mar 2, 2023

Here is my first draft for TranScorerLM!

Transcorer using Wav2Vec2.

❯ transcorer -f 'audio.wav'
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading TranScorer......Took 2.2624789050023537 second(s).
Loading audio.wav......Took 0.00048493899521417916 second(s).
Tokenizing......Took 0.0008750690030865371 second(s).
Decoding speech......Took 0.21528533000673633 second(s).
CAN I TEST YOU

Training interface

❯ trainscorer --help
usage: trainscorer [-h] --model_name_or_path MODEL_NAME_OR_PATH [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--cache_dir CACHE_DIR]
                   [--freeze_feature_encoder [FREEZE_FEATURE_ENCODER]] [--no_freeze_feature_encoder] [--attention_dropout ATTENTION_DROPOUT] [--activation_dropout ACTIVATION_DROPOUT]
                   [--feat_proj_dropout FEAT_PROJ_DROPOUT] [--hidden_dropout HIDDEN_DROPOUT] [--final_dropout FINAL_DROPOUT] [--mask_time_prob MASK_TIME_PROB]
                   [--mask_time_length MASK_TIME_LENGTH] [--mask_feature_prob MASK_FEATURE_PROB] [--mask_feature_length MASK_FEATURE_LENGTH] [--layerdrop LAYERDROP]
                   [--ctc_loss_reduction CTC_LOSS_REDUCTION] --dataset_name DATASET_NAME [--dataset_config_name DATASET_CONFIG_NAME] [--train_split_name TRAIN_SPLIT_NAME]
                   [--eval_split_name EVAL_SPLIT_NAME] [--audio_column_name AUDIO_COLUMN_NAME] [--text_column_name TEXT_COLUMN_NAME] [--overwrite_cache [OVERWRITE_CACHE]]
                   [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] [--max_train_samples MAX_TRAIN_SAMPLES] [--max_eval_samples MAX_EVAL_SAMPLES]
                   [--chars_to_ignore CHARS_TO_IGNORE [CHARS_TO_IGNORE ...]] [--eval_metrics EVAL_METRICS [EVAL_METRICS ...]] [--max_duration_in_seconds MAX_DURATION_IN_SECONDS]
                   [--min_duration_in_seconds MIN_DURATION_IN_SECONDS] [--preprocessing_only [PREPROCESSING_ONLY]] [--use_auth_token [USE_AUTH_TOKEN]] [--unk_token UNK_TOKEN]
                   [--pad_token PAD_TOKEN] [--word_delimiter_token WORD_DELIMITER_TOKEN] [--phoneme_language PHONEME_LANGUAGE] --output_dir OUTPUT_DIR
                   [--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}]
                   [--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
                   [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                   [--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
                   [--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
                   [--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}] [--warmup_ratio WARMUP_RATIO] [--warmup_steps WARMUP_STEPS]
                   [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}] [--log_on_each_node [LOG_ON_EACH_NODE]]
                   [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}] [--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS]
                   [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter] [--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS]
                   [--save_total_limit SAVE_TOTAL_LIMIT] [--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED]
                   [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]] [--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL]
                   [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}] [--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK]
                   [--xpu_backend {mpi,ccl,gloo}] [--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG] [--dataloader_drop_last [DATALOADER_DROP_LAST]]
                   [--eval_steps EVAL_STEPS] [--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
                   [--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]]
                   [--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER]
                   [--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS]
                   [--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
                   [--optim {adamw_hf,adamw_torch,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_bnb_8bit,adamw_anyprecision,sgd,adagrad}] [--optim_args OPTIM_ARGS]
                   [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]]
                   [--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
                   [--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
                   [--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID]
                   [--hub_strategy {end,every_save,checkpoint,all_checkpoints}] [--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]]
                   [--gradient_checkpointing [GRADIENT_CHECKPOINTING]] [--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}]
                   [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID] [--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN]
                   [--mp_parameters MP_PARAMETERS] [--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]]
                   [--torchdynamo {eager,aot_eager,inductor,nvfuser,aot_nvfuser,aot_cudagraphs,ofi,fx2trt,onnxrt,ipex}] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
                   [--torch_compile [TORCH_COMPILE]] [--torch_compile_backend {eager,aot_eager,inductor,nvfuser,aot_nvfuser,aot_cudagraphs,ofi,fx2trt,onnxrt,ipex}]
                   [--torch_compile_mode {default,reduce-overhead,max-autotune}]

options:
  -h, --help            show this help message and exit
  --model_name_or_path MODEL_NAME_OR_PATH
                        Path to pretrained model or model identifier from huggingface.co/models (default: None)
  --tokenizer_name_or_path TOKENIZER_NAME_OR_PATH
                        Path to pretrained tokenizer or tokenizer identifier from huggingface.co/models (default: None)
  --cache_dir CACHE_DIR
                        Where do you want to store the pretrained models downloaded from huggingface.co (default: None)
  --freeze_feature_encoder [FREEZE_FEATURE_ENCODER]
                        Whether to freeze the feature encoder layers of the model. (default: True)
  --no_freeze_feature_encoder
                        Whether to freeze the feature encoder layers of the model. (default: False)
  --attention_dropout ATTENTION_DROPOUT
                        The dropout ratio for the attention probabilities. (default: 0.0)
  --activation_dropout ACTIVATION_DROPOUT
                        The dropout ratio for activations inside the fully connected layer. (default: 0.0)
  --feat_proj_dropout FEAT_PROJ_DROPOUT
                        The dropout ratio for the projected features. (default: 0.0)
  --hidden_dropout HIDDEN_DROPOUT
                        The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. (default: 0.0)
  --final_dropout FINAL_DROPOUT
                        The dropout probability for the final projection layer. (default: 0.0)
  --mask_time_prob MASK_TIME_PROB
                        Probability of each feature vector along the time axis to be chosen as the start of the vectorspan to be masked. Approximately ``mask_time_prob * sequence_length
                        // mask_time_length`` featurevectors will be masked along the time axis. (default: 0.05)
  --mask_time_length MASK_TIME_LENGTH
                        Length of vector span to mask along the time axis. (default: 10)
  --mask_feature_prob MASK_FEATURE_PROB
                        Probability of each feature vector along the feature axis to be chosen as the start of the vectorspan to be masked. Approximately ``mask_feature_prob *
                        sequence_length // mask_feature_length`` feature bins will be masked along the time axis. (default: 0.0)
  --mask_feature_length MASK_FEATURE_LENGTH
                        Length of vector span to mask along the feature axis. (default: 10)
  --layerdrop LAYERDROP
                        The LayerDrop probability. (default: 0.0)
  --ctc_loss_reduction CTC_LOSS_REDUCTION
                        The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'. (default: mean)
  --dataset_name DATASET_NAME
                        The configuration name of the dataset to use (via the datasets library). (default: None)
  --dataset_config_name DATASET_CONFIG_NAME
                        The configuration name of the dataset to use (via the datasets library). (default: None)
  --train_split_name TRAIN_SPLIT_NAME
                        The name of the training data set split to use (via the datasets library). Defaults to 'train+validation' (default: train+validation)
  --eval_split_name EVAL_SPLIT_NAME
                        The name of the evaluation data set split to use (via the datasets library). Defaults to 'test' (default: test)
  --audio_column_name AUDIO_COLUMN_NAME
                        The name of the dataset column containing the audio data. Defaults to 'audio' (default: audio)
  --text_column_name TEXT_COLUMN_NAME
                        The name of the dataset column containing the text data. Defaults to 'text' (default: text)
  --overwrite_cache [OVERWRITE_CACHE]
                        Overwrite the cached preprocessed datasets or not. (default: False)
  --preprocessing_num_workers PREPROCESSING_NUM_WORKERS
                        The number of processes to use for the preprocessing. (default: None)
  --max_train_samples MAX_TRAIN_SAMPLES
                        For debugging purposes or quicker training, truncate the number of training examples to this value if set. (default: None)
  --max_eval_samples MAX_EVAL_SAMPLES
                        For debugging purposes or quicker training, truncate the number of validation examples to this value if set. (default: None)
  --chars_to_ignore CHARS_TO_IGNORE [CHARS_TO_IGNORE ...]
                        A list of characters to remove from the transcripts. (default: None)
  --eval_metrics EVAL_METRICS [EVAL_METRICS ...]
                        A list of metrics the model should be evaluated on. E.g. `'wer cer'` (default: ['wer'])
  --max_duration_in_seconds MAX_DURATION_IN_SECONDS
                        Filter audio files that are longer than `max_duration_in_seconds` seconds to 'max_duration_in_seconds` (default: 20.0)
  --min_duration_in_seconds MIN_DURATION_IN_SECONDS
                        Filter audio files that are shorter than `min_duration_in_seconds` seconds (default: 0.0)
  --preprocessing_only [PREPROCESSING_ONLY]
                        Whether to only do data preprocessing and skip training. This is especially useful when data preprocessing errors out in distributed training due to timeout. In
                        this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` so that the cached datasets can consequently be loaded in
                        distributed training (default: False)
  --use_auth_token [USE_AUTH_TOKEN]
                        If :obj:`True`, will use the token generated when running:obj:`huggingface-cli login` as HTTP bearer authorization for remote files. (default: False)
  --unk_token UNK_TOKEN
                        The unk token for the tokenizer (default: [UNK])
  --pad_token PAD_TOKEN
                        The padding token for the tokenizer (default: [PAD])
  --word_delimiter_token WORD_DELIMITER_TOKEN
                        The word delimiter token for the tokenizer (default: |)
  --phoneme_language PHONEME_LANGUAGE
                        The target language that should be used be passed to the tokenizer for tokenization. Note that this is only relevant if the model classifies the input audio to a
                        sequence of phoneme sequences. (default: None)
  --output_dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written. (default: None)
  --overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
                        Overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory. (default: False)
  --do_train [DO_TRAIN]
                        Whether to run training. (default: False)
  --do_eval [DO_EVAL]   Whether to run eval on the dev set. (default: False)
  --do_predict [DO_PREDICT]
                        Whether to run predictions on the test set. (default: False)
  --evaluation_strategy {no,steps,epoch}
                        The evaluation strategy to use. (default: no)
  --prediction_loss_only [PREDICTION_LOSS_ONLY]
                        When performing evaluation and predictions, only returns the loss. (default: False)
  --per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
                        Batch size per GPU/TPU core/CPU for training. (default: 8)
  --per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
                        Batch size per GPU/TPU core/CPU for evaluation. (default: 8)
  --per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE
                        Deprecated, the use of `--per_device_train_batch_size` is preferred. Batch size per GPU/TPU core/CPU for training. (default: None)
  --per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE
                        Deprecated, the use of `--per_device_eval_batch_size` is preferred. Batch size per GPU/TPU core/CPU for evaluation. (default: None)
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
  --eval_accumulation_steps EVAL_ACCUMULATION_STEPS
                        Number of predictions steps to accumulate before moving the tensors to the CPU. (default: None)
  --eval_delay EVAL_DELAY
                        Number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation_strategy. (default: 0)
  --learning_rate LEARNING_RATE
                        The initial learning rate for AdamW. (default: 5e-05)
  --weight_decay WEIGHT_DECAY
                        Weight decay for AdamW if we apply some. (default: 0.0)
  --adam_beta1 ADAM_BETA1
                        Beta1 for AdamW optimizer (default: 0.9)
  --adam_beta2 ADAM_BETA2
                        Beta2 for AdamW optimizer (default: 0.999)
  --adam_epsilon ADAM_EPSILON
                        Epsilon for AdamW optimizer. (default: 1e-08)
  --max_grad_norm MAX_GRAD_NORM
                        Max gradient norm. (default: 1.0)
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Total number of training epochs to perform. (default: 3.0)
  --max_steps MAX_STEPS
                        If > 0: set total number of training steps to perform. Override num_train_epochs. (default: -1)
  --lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}
                        The scheduler type to use. (default: linear)
  --warmup_ratio WARMUP_RATIO
                        Linear warmup over warmup_ratio fraction of total steps. (default: 0.0)
  --warmup_steps WARMUP_STEPS
                        Linear warmup over warmup_steps. (default: 0)
  --log_level {debug,info,warning,error,critical,passive}
                        Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug', 'info', 'warning', 'error' and 'critical', plus a 'passive'
                        level which doesn't set anything and lets the application set the level. Defaults to 'passive'. (default: passive)
  --log_level_replica {debug,info,warning,error,critical,passive}
                        Logger log level to use on replica nodes. Same choices and defaults as ``log_level`` (default: passive)
  --log_on_each_node [LOG_ON_EACH_NODE]
                        When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: True)
  --no_log_on_each_node
                        When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: False)
  --logging_dir LOGGING_DIR
                        Tensorboard log dir. (default: None)
  --logging_strategy {no,steps,epoch}
                        The logging strategy to use. (default: steps)
  --logging_first_step [LOGGING_FIRST_STEP]
                        Log the first global_step (default: False)
  --logging_steps LOGGING_STEPS
                        Log every X updates steps. (default: 500)
  --logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]
                        Filter nan and inf losses for logging. (default: True)
  --no_logging_nan_inf_filter
                        Filter nan and inf losses for logging. (default: False)
  --save_strategy {no,steps,epoch}
                        The checkpoint save strategy to use. (default: steps)
  --save_steps SAVE_STEPS
                        Save checkpoint every X updates steps. (default: 500)
  --save_total_limit SAVE_TOTAL_LIMIT
                        Limit the total amount of checkpoints. Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints (default: None)
  --save_on_each_node [SAVE_ON_EACH_NODE]
                        When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one (default: False)
  --no_cuda [NO_CUDA]   Do not use CUDA even when it is available (default: False)
  --use_mps_device [USE_MPS_DEVICE]
                        Whether to use Apple Silicon chip based `mps` device. (default: False)
  --seed SEED           Random seed that will be set at the beginning of training. (default: 42)
  --data_seed DATA_SEED
                        Random seed to be used with data samplers. (default: None)
  --jit_mode_eval [JIT_MODE_EVAL]
                        Whether or not to use PyTorch jit trace for inference (default: False)
  --use_ipex [USE_IPEX]
                        Use Intel extension for PyTorch when it is available, installation: 'https://github.com/intel/intel-extension-for-pytorch' (default: False)
  --bf16 [BF16]         Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA architecture or using CPU (no_cuda). This is an experimental API and it
                        may change. (default: False)
  --fp16 [FP16]         Whether to use fp16 (mixed) precision instead of 32-bit (default: False)
  --fp16_opt_level FP16_OPT_LEVEL
                        For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details at https://nvidia.github.io/apex/amp.html (default: O1)
  --half_precision_backend {auto,cuda_amp,apex,cpu_amp}
                        The backend to be used for half precision. (default: auto)
  --bf16_full_eval [BF16_FULL_EVAL]
                        Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may change. (default: False)
  --fp16_full_eval [FP16_FULL_EVAL]
                        Whether to use full float16 evaluation instead of 32-bit (default: False)
  --tf32 TF32           Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental API and it may change. (default: None)
  --local_rank LOCAL_RANK
                        For distributed training: local_rank (default: -1)
  --xpu_backend {mpi,ccl,gloo}
                        The backend to be used for distributed training on Intel XPU. (default: None)
  --tpu_num_cores TPU_NUM_CORES
                        TPU: Number of TPU cores (automatically passed by launcher script) (default: None)
  --tpu_metrics_debug [TPU_METRICS_DEBUG]
                        Deprecated, the use of `--debug tpu_metrics_debug` is preferred. TPU: Whether to print debug metrics (default: False)
  --debug DEBUG         Whether or not to enable debug mode. Current options: `underflow_overflow` (Detect underflow and overflow in activations and weights), `tpu_metrics_debug` (print
                        debug metrics on TPU). (default: )
  --dataloader_drop_last [DATALOADER_DROP_LAST]
                        Drop the last incomplete batch if it is not divisible by the batch size. (default: False)
  --eval_steps EVAL_STEPS
                        Run an evaluation every X steps. (default: None)
  --dataloader_num_workers DATALOADER_NUM_WORKERS
                        Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process. (default: 0)
  --past_index PAST_INDEX
                        If >=0, uses the corresponding part of the output as the past state for next step. (default: -1)
  --run_name RUN_NAME   An optional descriptor for the run. Notably used for wandb logging. (default: None)
  --disable_tqdm DISABLE_TQDM
                        Whether or not to disable the tqdm progress bars. (default: None)
  --remove_unused_columns [REMOVE_UNUSED_COLUMNS]
                        Remove columns not required by the model when using an nlp.Dataset. (default: True)
  --no_remove_unused_columns
                        Remove columns not required by the model when using an nlp.Dataset. (default: False)
  --label_names LABEL_NAMES [LABEL_NAMES ...]
                        The list of keys in your dictionary of inputs that correspond to the labels. (default: None)
  --load_best_model_at_end [LOAD_BEST_MODEL_AT_END]
                        Whether or not to load the best model found during training at the end of training. (default: False)
  --metric_for_best_model METRIC_FOR_BEST_MODEL
                        The metric to use to compare two different models. (default: None)
  --greater_is_better GREATER_IS_BETTER
                        Whether the `metric_for_best_model` should be maximized or not. (default: None)
  --ignore_data_skip [IGNORE_DATA_SKIP]
                        When resuming training, whether or not to skip the first epochs and batches to get to the same training data. (default: False)
  --sharded_ddp SHARDED_DDP
                        Whether or not to use sharded DDP training (in distributed training only). The base option should be `simple`, `zero_dp_2` or `zero_dp_3` and you can add CPU-
                        offload to `zero_dp_2` or `zero_dp_3` like this: zero_dp_2 offload` or `zero_dp_3 offload`. You can add auto-wrap to `zero_dp_2` or `zero_dp_3` with the same
                        syntax: zero_dp_2 auto_wrap` or `zero_dp_3 auto_wrap`. (default: )
  --fsdp FSDP           Whether or not to use PyTorch Fully Sharded Data Parallel (FSDP) training (in distributed training only). The base option should be `full_shard`, `shard_grad_op`
                        or `no_shard` and you can add CPU-offload to `full_shard` or `shard_grad_op` like this: full_shard offload` or `shard_grad_op offload`. You can add auto-wrap to
                        `full_shard` or `shard_grad_op` with the same syntax: full_shard auto_wrap` or `shard_grad_op auto_wrap`. (default: )
  --fsdp_min_num_params FSDP_MIN_NUM_PARAMS
                        FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `fsdp` field is passed). (default: 0)
  --fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP
                        Transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` .... (useful only when `fsdp` flag is passed). (default: None)
  --deepspeed DEEPSPEED
                        Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict (default: None)
  --label_smoothing_factor LABEL_SMOOTHING_FACTOR
                        The label smoothing epsilon to apply (zero means no label smoothing). (default: 0.0)
  --optim {adamw_hf,adamw_torch,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_bnb_8bit,adamw_anyprecision,sgd,adagrad}
                        The optimizer to use. (default: adamw_hf)
  --optim_args OPTIM_ARGS
                        Optional arguments to supply to optimizer. (default: None)
  --adafactor [ADAFACTOR]
                        Whether or not to replace AdamW by Adafactor. (default: False)
  --group_by_length [GROUP_BY_LENGTH]
                        Whether or not to group samples of roughly the same length together when batching. (default: False)
  --length_column_name LENGTH_COLUMN_NAME
                        Column name with precomputed lengths to use when grouping by length. (default: length)
  --report_to REPORT_TO [REPORT_TO ...]
                        The list of integrations to report the results and logs to. (default: None)
  --ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS
                        When using distributed training, the value of the flag `find_unused_parameters` passed to `DistributedDataParallel`. (default: None)
  --ddp_bucket_cap_mb DDP_BUCKET_CAP_MB
                        When using distributed training, the value of the flag `bucket_cap_mb` passed to `DistributedDataParallel`. (default: None)
  --dataloader_pin_memory [DATALOADER_PIN_MEMORY]
                        Whether or not to pin memory for DataLoader. (default: True)
  --no_dataloader_pin_memory
                        Whether or not to pin memory for DataLoader. (default: False)
  --skip_memory_metrics [SKIP_MEMORY_METRICS]
                        Whether or not to skip adding of memory profiler reports to metrics. (default: True)
  --no_skip_memory_metrics
                        Whether or not to skip adding of memory profiler reports to metrics. (default: False)
  --use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]
                        Whether or not to use the legacy prediction_loop in the Trainer. (default: False)
  --push_to_hub [PUSH_TO_HUB]
                        Whether or not to upload the trained model to the model hub after training. (default: False)
  --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                        The path to a folder with a valid checkpoint for your model. (default: None)
  --hub_model_id HUB_MODEL_ID
                        The name of the repository to keep in sync with the local `output_dir`. (default: None)
  --hub_strategy {end,every_save,checkpoint,all_checkpoints}
                        The hub strategy to use when `--push_to_hub` is activated. (default: every_save)
  --hub_token HUB_TOKEN
                        The token to use to push to the Model Hub. (default: None)
  --hub_private_repo [HUB_PRIVATE_REPO]
                        Whether the model repository is private or not. (default: False)
  --gradient_checkpointing [GRADIENT_CHECKPOINTING]
                        If True, use gradient checkpointing to save memory at the expense of slower backward pass. (default: False)
  --include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]
                        Whether or not the inputs will be passed to the `compute_metrics` function. (default: False)
  --fp16_backend {auto,cuda_amp,apex,cpu_amp}
                        Deprecated. Use half_precision_backend instead (default: auto)
  --push_to_hub_model_id PUSH_TO_HUB_MODEL_ID
                        The name of the repository to which push the `Trainer`. (default: None)
  --push_to_hub_organization PUSH_TO_HUB_ORGANIZATION
                        The name of the organization in with to which push the `Trainer`. (default: None)
  --push_to_hub_token PUSH_TO_HUB_TOKEN
                        The token to use to push to the Model Hub. (default: None)
  --mp_parameters MP_PARAMETERS
                        Used by the SageMaker launcher to send mp-specific args. Ignored in Trainer (default: )
  --auto_find_batch_size [AUTO_FIND_BATCH_SIZE]
                        Whether to automatically decrease the batch size in half and rerun the training loop again each time a CUDA Out-of-Memory was reached (default: False)
  --full_determinism [FULL_DETERMINISM]
                        Whether to call enable_full_determinism instead of set_seed for reproducibility in distributed training (default: False)
  --torchdynamo {eager,aot_eager,inductor,nvfuser,aot_nvfuser,aot_cudagraphs,ofi,fx2trt,onnxrt,ipex}
                        This argument is deprecated, use `--torch_compile_backend` instead. (default: None)
  --ray_scope RAY_SCOPE
                        The scope to use when doing hyperparameter search with Ray. By default, `"last"` will be used. Ray will then use the last checkpoint of all trials, compare those,
                        and select the best one. However, other options are also available. See the Ray documentation
                        (https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for more options. (default: last)
  --ddp_timeout DDP_TIMEOUT
                        Overrides the default timeout for distributed training (value should be given in seconds). (default: 1800)
  --torch_compile [TORCH_COMPILE]
                        If set to `True`, the model will be wrapped in `torch.compile`. (default: False)
  --torch_compile_backend {eager,aot_eager,inductor,nvfuser,aot_nvfuser,aot_cudagraphs,ofi,fx2trt,onnxrt,ipex}
                        Which backend to use with `torch.compile`, passing one will trigger a model compilation. (default: None)
  --torch_compile_mode {default,reduce-overhead,max-autotune}
                        Which mode to use with `torch.compile`, passing one will trigger a model compilation. (default: None)

I wanted to use coqui's trainer but it was literally quicker to use transformers.Trainer.

There is still a lot to do but I feel it's a solid jump start.

@ftyers
Copy link
Contributor

ftyers commented Mar 24, 2023

This would be interesting, but only if it doesn't require further dependencies!

@wasertech
Copy link
Collaborator Author

wasertech commented Mar 24, 2023

Yes. I’m just gonna make a test on transformers library and train on the data I used to train my latest STT model to compare Wav2Vec2 and STT on the same data. Then if the results are conclusive we can spend more time and energy into implementing it inside STT ourselves.

@wasertech
Copy link
Collaborator Author

You would still need different dependencies than kenlm and the acoustic model itself might have not everything we need, so it is unavoidable to have specific dependencies to train and inference the CTC decoder but I guess we don’t need transformers nor datasets libraries but it’s already implemented and allows me to train now if I want to.

Since I don’t dabble that close to the metal, I’m not fluent enough with C and it’s derivative to make an end2end PR but I enjoy any help towards better CTC decoder for everyone.

@wasertech
Copy link
Collaborator Author

wasertech commented May 2, 2023

image_20230502111955.png

I managed to get training started on my French data used for my latest STT models. It will give us a nice way to compare performance improvements.

@wasertech
Copy link
Collaborator Author

wasertech commented May 7, 2023

Nothing definitive yet but I'm having some nice results on evaluation.

***** Running Evaluation *****
  Num examples = 35556
  Batch size = 16
{'eval_loss': 0.19231227040290833, 'eval_wer': 0.12087098972350516, 'eval_runtime': 1700.0609, 'eval_samples_per_second': 20.915, 'eval_steps_per_second': 1.308, 'epoch': 1.07}

I don't see any sines of convergence yet so I'll let it try to find any lower ground. Then I'll compute some metrics on my test sets.

I can tell already it's blowing KenLM out of the water as a CTC decoder for open domain speech.

@FrontierDK
Copy link

Nothing definitive yet but I'm having some nice results on evaluation.

***** Running Evaluation *****
  Num examples = 35556
  Batch size = 16
{'eval_loss': 0.19231227040290833, 'eval_wer': 0.12087098972350516, 'eval_runtime': 1700.0609, 'eval_samples_per_second': 20.915, 'eval_steps_per_second': 1.308, 'epoch': 1.07}

I don't see any sines of convergence yet so I'll let it try to find any lower ground. Then I'll compute some metrics on my test sets.

I can tell already it's blowing KenLM out of the water as a CTC decoder for open domain speech.

Any chance you would write a walk-through for the alternative scorer, so that more can use it / create a need? :)

@wasertech
Copy link
Collaborator Author

wasertech commented May 7, 2023

Any chance you would write a walk-through for the alternative scorer, so that more can use it / create a need? :)

Most likely yes but it's important to understand that Wav2vec2 is another architecture not just an alternative implement of a CTC decoder.

In the crurent sate it doesn't work with STT acoustic models. I had to train a new acoustic and language models. I'm sure we could replace KenLM with a similar transformer we would train on the raw output of our STT acoustic models so that we don't have to retrain them. Yet I lack the energy and the time to undertake such a upgrade of STT.

For now I just want a nice model for French transcription on open domain. Still I find myself missing some pretty nice convenient features available on STT. I am currently reimplementing some of them inside my training container loosely similar to what we had on STT docker image.

At term we could have a complete pipeline for training, testing and inference so that proper documentation can be written.

P.S: @FrontierDK If you want to more context about fine-tuning Wav2Vec on your data, you can check this lovely article.

@FrontierDK
Copy link

Well, I'm also training from scratch with my own models, and KenLM does seem to improve the recognized text a lot (far less errors) - but being installed in VMs, I can try anything new without consequences. So please share the knowledge? :)

@fquirin
Copy link

fquirin commented May 8, 2023

Interesting, is this a similar idea to the neural scorer from Nvidia NeMo?
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html#neural-rescoring

@wasertech
Copy link
Collaborator Author

Interesting, is this a similar idea to the neural scorer from Nvidia NeMo?
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html#neural-rescoring

Yep it's really similar. In fact I think Wav2vec2 also uses a Bert transformer to jump start the network. In terms of accuracy I think Alphabet's one is slightly better but a the cost of size and inference time. They tend to release bigger models for their massive infrastructure whereas MetaAI (Facebook R&D) have less resources so they tend to produce smaller yet effective models that can run on our local machines.

@wasertech
Copy link
Collaborator Author

wasertech commented May 10, 2023

I have managed to train successfully a small model with 1K hours with promising results.

wasertech/TranScorerLM#2

I'm now in the process of training on all my datasets. 🤞

@fquirin
Copy link

fquirin commented May 10, 2023

I had to train a new acoustic and language models

So is this technically still Coqui-STT or rather a completely new ASR system? 😅

@wasertech
Copy link
Collaborator Author

wasertech commented May 10, 2023

I had to train a new acoustic and language models

So is this technically still Coqui-STT or rather a completely new ASR system? 😅

it's important to understand that Wav2vec2 is another architecture not just an alternative implement of a CTC decoder [for STT].

Completely new based on the paper I shared above.

https://arxiv.org/pdf/2006.11477.pdf

@fquirin
Copy link

fquirin commented May 10, 2023

This was the foundation of Meta AI's wav2letter right? (https://github.com/flashlight/wav2letter)

btw maybe you should change the headline of this topic, since it suggests you want to replace the scorer LM of Coqui-STT, but you are actually building a new ASR system 😅

@HarikalarKutusu
Copy link

My thoughts exactly :)

@wasertech
Copy link
Collaborator Author

btw maybe you should change the headline of this topic, since it suggests you want to replace the scorer LM of Coqui-STT, but you are actually building a new ASR system 😅

It's a ticket opened for feature improvement into STT not a PR. The goal was originally and is still to replace KenLM with a transformer network so to improve accuracy.

As I said multiple times already, I'm currently only interested into providing actionnable metrics that provides a better comparaison picture between both architectures so to incentive skilled people to help me implement it into STT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants