Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

wasertech · 2023-01-22T23:56:42Z

Trying to optimize my LM but lm_optimizer.py throws NotFoundError as environment has CuDNN disabled.

Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.

I want to use my GPU --'

FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:~optuna.trial.Trial.suggest_float instead.

Related?

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

I have a bad feeling about this one.

+ python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 2023-01-22 23:18:04,503] A new study created in memory with name: no-name-0f421b63-297c-468c-b30d-8aa59857a843
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:30: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_alpha = trial.suggest_uniform("lm_alpha", 0, Config.lm_alpha_max)
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:31: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_beta = trial.suggest_uniform("lm_beta", 0, Config.lm_beta_max)
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-221133
W Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
[W 2023-01-22 23:18:05,201] Trial 0 failed with parameters: {'lm_alpha': 0.26985826312830485, 'lm_beta': 1.3371065634850314} because of the following error: NotFoundError().
Traceback (most recent call last):
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 121, in _load_checkpoint
    return _load_checkpoint_impl(
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 21, in _load_checkpoint_impl
    ckpt = tfv1.train.load_checkpoint(checkpoint_path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 873, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 885, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

To Reproduce
Steps to reproduce the behavior:
Full logs

Expected behavior
A study should start on the GPU for 50 trails.

Environment (please complete the following information): Docker

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker
TensorFlow installed from (our builds, or upstream TensorFlow): 22.02-tf1
TensorFlow version (use command below): 22.02-tf1
Python version: 3.8
Bazel version (if compiling from source): 5.0
GCC/Compiler version (if compiling from source):10
CUDA/cuDNN version:11.6.0.021
GPU model and memory:RTX 3060 12Gb
Exact command to reproduce: python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint

Additional context
Built using the Training Wizard for STT

The text was updated successfully, but these errors were encountered:

wasertech · 2023-01-22T23:58:19Z

I think /mnt/checkpoints/best_dev-221133 doesn't exist but can't seem to find we it comes from... checkpoint file is in /transfer-checkpoint.

wasertech · 2023-01-23T00:40:46Z

Yes it's /transfer-checkpoint/best_dev_checkpoint pointing to /mnt/checkpoints/best_dev-221133:

# /transfer-checkpoint/best_dev_checkpoint
model_checkpoint_path: "/mnt/checkpoints/best_dev-221133"
all_model_checkpoint_paths: "/mnt/checkpoints/best_dev-221133"

lm_optimizer should probably expect tensorflow.python.framework.errors_impl.NotFoundError here:

STT/training/coqui_stt_training/util/lm_optimize.py

Line 39 in a694187

current_samples = evaluate([test_file], create_model)

Or directly when computing results, in main:

STT/training/coqui_stt_training/util/lm_optimize.py

Lines 86 to 93 in a694187

    
           results = compute_lm_optimization() 
        
           print( 
        
               "Best params: lm_alpha={} and lm_beta={} with WER={}".format( 
        
                   results.get("lm_alpha"), 
        
                   results.get("lm_beta"), 
        
                   results.get("wer"), 
        
               ) 
        
           )

Something like:

import sys
...
from tensorflow.python.framework.errors_impl import NotFoundError
...
try:
    results = compute_lm_optimization()
    print(
        "Best params: lm_alpha={} and lm_beta={} with WER={}".format(
            results.get("lm_alpha"),
            results.get("lm_beta"),
            results.get("wer"),
        )
    )
expect NotFoundError as e:
    print("Your checkpoint  /transfer-checkpoint/best_dev_checkpoint points to an empty checkpoint file /mnt/checkpoints/best_dev-221133\nMake sure you give a valid --checkpoint_dir path.")
    sys.exit(1)

Note: need to find variables holding /transfer-checkpoint/best_dev_checkpoint and /mnt/checkpoints/best_dev-221133. (filename and checkpoint_path?)

wasertech added the bug Something isn't working label Jan 22, 2023

wasertech added the enhancement New feature or request label Jan 23, 2023

wasertech mentioned this issue Feb 1, 2023

Fix best_dev_checkpoint path common-voice/commonvoice-fr#167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

wasertech commented Jan 22, 2023

wasertech commented Jan 22, 2023

wasertech commented Jan 23, 2023 •

edited

Improvment: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint #2338

Improvment: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint #2338

Comments

wasertech commented Jan 22, 2023

wasertech commented Jan 22, 2023

wasertech commented Jan 23, 2023 • edited

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

wasertech commented Jan 23, 2023 •

edited