Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

新人求解,训练模型时,在打印完日志‘During the training process, after the 0th iteration, an evaluation is run every 400 iterations’后,卡住了 #12083

Closed
Lijian500 opened this issue May 9, 2024 · 3 comments
Assignees

Comments

@Lijian500
Copy link

问题: 训练模型时,在打印完日志‘During the training process, after the 0th iteration, an evaluation is run every 400 iterations’后,没有动静了(超过半个小时没有任何其他日志输出),需要如何排查问题。

  • 系统环境/System Environment:windons 11
  • 版本号/Version:Paddle:2.6.1 PaddleOCR:2.7 问题相关组件/Related components:
  • 运行指令/Command Code:python tools/train.py -c pretrain_models/ch_PP-OCRv3_det_cml.yml
  • 完整报错/Complete Error Message:

[2024/05/09 14:55:16] ppocr INFO: Architecture :
[2024/05/09 14:55:16] ppocr INFO: Models :
[2024/05/09 14:55:16] ppocr INFO: Student :
[2024/05/09 14:55:16] ppocr INFO: Backbone :
[2024/05/09 14:55:16] ppocr INFO: disable_se : True
[2024/05/09 14:55:16] ppocr INFO: model_name : large
[2024/05/09 14:55:16] ppocr INFO: name : MobileNetV3
[2024/05/09 14:55:16] ppocr INFO: scale : 0.5
[2024/05/09 14:55:16] ppocr INFO: Head :
[2024/05/09 14:55:16] ppocr INFO: k : 50
[2024/05/09 14:55:16] ppocr INFO: name : DBHead
[2024/05/09 14:55:16] ppocr INFO: Neck :
[2024/05/09 14:55:16] ppocr INFO: name : RSEFPN
[2024/05/09 14:55:16] ppocr INFO: out_channels : 96
[2024/05/09 14:55:16] ppocr INFO: shortcut : True
[2024/05/09 14:55:16] ppocr INFO: Transform : None
[2024/05/09 14:55:16] ppocr INFO: algorithm : DB
[2024/05/09 14:55:16] ppocr INFO: model_type : det
[2024/05/09 14:55:16] ppocr INFO: pretrained : None
[2024/05/09 14:55:16] ppocr INFO: Student2 :
[2024/05/09 14:55:16] ppocr INFO: Backbone :
[2024/05/09 14:55:16] ppocr INFO: disable_se : True
[2024/05/09 14:55:16] ppocr INFO: model_name : large
[2024/05/09 14:55:16] ppocr INFO: name : MobileNetV3
[2024/05/09 14:55:16] ppocr INFO: scale : 0.5
[2024/05/09 14:55:16] ppocr INFO: Head :
[2024/05/09 14:55:16] ppocr INFO: k : 50
[2024/05/09 14:55:16] ppocr INFO: name : DBHead
[2024/05/09 14:55:16] ppocr INFO: Neck :
[2024/05/09 14:55:16] ppocr INFO: name : RSEFPN
[2024/05/09 14:55:16] ppocr INFO: out_channels : 96
[2024/05/09 14:55:16] ppocr INFO: shortcut : True
[2024/05/09 14:55:16] ppocr INFO: Transform : None
[2024/05/09 14:55:16] ppocr INFO: algorithm : DB
[2024/05/09 14:55:16] ppocr INFO: model_type : det
[2024/05/09 14:55:16] ppocr INFO: pretrained : None
[2024/05/09 14:55:16] ppocr INFO: Teacher :
[2024/05/09 14:55:16] ppocr INFO: Backbone :
[2024/05/09 14:55:16] ppocr INFO: in_channels : 3
[2024/05/09 14:55:16] ppocr INFO: layers : 50
[2024/05/09 14:55:16] ppocr INFO: name : ResNet_vd
[2024/05/09 14:55:16] ppocr INFO: Head :
[2024/05/09 14:55:16] ppocr INFO: k : 50
[2024/05/09 14:55:16] ppocr INFO: kernel_list : [7, 2, 2]
[2024/05/09 14:55:16] ppocr INFO: name : DBHead
[2024/05/09 14:55:16] ppocr INFO: Neck :
[2024/05/09 14:55:16] ppocr INFO: name : LKPAN
[2024/05/09 14:55:16] ppocr INFO: out_channels : 256
[2024/05/09 14:55:16] ppocr INFO: algorithm : DB
[2024/05/09 14:55:16] ppocr INFO: freeze_params : True
[2024/05/09 14:55:16] ppocr INFO: model_type : det
[2024/05/09 14:55:16] ppocr INFO: return_all_feats : False
[2024/05/09 14:55:16] ppocr INFO: algorithm : Distillation
[2024/05/09 14:55:16] ppocr INFO: model_type : det
[2024/05/09 14:55:16] ppocr INFO: name : DistillationModel
[2024/05/09 14:55:16] ppocr INFO: Eval :
[2024/05/09 14:55:16] ppocr INFO: dataset :
[2024/05/09 14:55:16] ppocr INFO: data_dir : ./train_data/
[2024/05/09 14:55:16] ppocr INFO: label_file_list : ['./train_data/det/val.txt']
[2024/05/09 14:55:16] ppocr INFO: name : SimpleDataSet
[2024/05/09 14:55:16] ppocr INFO: transforms :
[2024/05/09 14:55:16] ppocr INFO: DecodeImage :
[2024/05/09 14:55:16] ppocr INFO: channel_first : False
[2024/05/09 14:55:16] ppocr INFO: img_mode : BGR
[2024/05/09 14:55:16] ppocr INFO: DetLabelEncode : None
[2024/05/09 14:55:16] ppocr INFO: DetResizeForTest : None
[2024/05/09 14:55:16] ppocr INFO: NormalizeImage :
[2024/05/09 14:55:16] ppocr INFO: mean : [0.485, 0.456, 0.406]
[2024/05/09 14:55:16] ppocr INFO: order : hwc
[2024/05/09 14:55:16] ppocr INFO: scale : 1./255.
[2024/05/09 14:55:16] ppocr INFO: std : [0.229, 0.224, 0.225]
[2024/05/09 14:55:16] ppocr INFO: ToCHWImage : None
[2024/05/09 14:55:16] ppocr INFO: KeepKeys :
[2024/05/09 14:55:16] ppocr INFO: keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2024/05/09 14:55:16] ppocr INFO: loader :
[2024/05/09 14:55:16] ppocr INFO: batch_size_per_card : 1
[2024/05/09 14:55:16] ppocr INFO: drop_last : False
[2024/05/09 14:55:16] ppocr INFO: num_workers : 2
[2024/05/09 14:55:16] ppocr INFO: shuffle : False
[2024/05/09 14:55:16] ppocr INFO: Global :
[2024/05/09 14:55:16] ppocr INFO: amp_dtype : bfloat16
[2024/05/09 14:55:16] ppocr INFO: cal_metric_during_train : False
[2024/05/09 14:55:16] ppocr INFO: checkpoints : None
[2024/05/09 14:55:16] ppocr INFO: d2s_train_image_shape : [3, -1, -1]
[2024/05/09 14:55:16] ppocr INFO: debug : False
[2024/05/09 14:55:16] ppocr INFO: distributed : False
[2024/05/09 14:55:16] ppocr INFO: epoch_num : 500
[2024/05/09 14:55:16] ppocr INFO: eval_batch_step : [0, 400]
[2024/05/09 14:55:16] ppocr INFO: infer_img : doc/imgs_en/img_10.jpg
[2024/05/09 14:55:16] ppocr INFO: log_smooth_window : 20
[2024/05/09 14:55:16] ppocr INFO: pretrained_model : ./pretrain_models/ch_PP-OCRv3_det_distill_train/ch_PP-OCRv3_det_distill_train/best_accuracy
[2024/05/09 14:55:16] ppocr INFO: print_batch_step : 10
[2024/05/09 14:55:16] ppocr INFO: save_epoch_step : 100
[2024/05/09 14:55:16] ppocr INFO: save_inference_dir : None
[2024/05/09 14:55:16] ppocr INFO: save_model_dir : ./output/ch_PP-OCR_v3_det/
[2024/05/09 14:55:16] ppocr INFO: save_res_path : ./checkpoints/det_db/predicts_db.txt
[2024/05/09 14:55:16] ppocr INFO: use_gpu : False
[2024/05/09 14:55:16] ppocr INFO: use_visualdl : False
[2024/05/09 14:55:16] ppocr INFO: Loss :
[2024/05/09 14:55:16] ppocr INFO: loss_config_list :
[2024/05/09 14:55:16] ppocr INFO: DistillationDilaDBLoss :
[2024/05/09 14:55:16] ppocr INFO: alpha : 5
[2024/05/09 14:55:16] ppocr INFO: balance_loss : True
[2024/05/09 14:55:16] ppocr INFO: beta : 10
[2024/05/09 14:55:16] ppocr INFO: key : maps
[2024/05/09 14:55:16] ppocr INFO: main_loss_type : DiceLoss
[2024/05/09 14:55:16] ppocr INFO: model_name_pairs : [['Student', 'Teacher'], ['Student2', 'Teacher']]
[2024/05/09 14:55:16] ppocr INFO: ohem_ratio : 3
[2024/05/09 14:55:16] ppocr INFO: weight : 1.0
[2024/05/09 14:55:16] ppocr INFO: DistillationDMLLoss :
[2024/05/09 14:55:16] ppocr INFO: key : maps
[2024/05/09 14:55:16] ppocr INFO: maps_name : thrink_maps
[2024/05/09 14:55:16] ppocr INFO: model_name_pairs : ['Student', 'Student2']
[2024/05/09 14:55:16] ppocr INFO: weight : 1.0
[2024/05/09 14:55:16] ppocr INFO: DistillationDBLoss :
[2024/05/09 14:55:16] ppocr INFO: alpha : 5
[2024/05/09 14:55:16] ppocr INFO: balance_loss : True
[2024/05/09 14:55:16] ppocr INFO: beta : 10
[2024/05/09 14:55:16] ppocr INFO: main_loss_type : DiceLoss
[2024/05/09 14:55:16] ppocr INFO: model_name_list : ['Student', 'Student2']
[2024/05/09 14:55:16] ppocr INFO: ohem_ratio : 3
[2024/05/09 14:55:16] ppocr INFO: weight : 1.0
[2024/05/09 14:55:16] ppocr INFO: name : CombinedLoss
[2024/05/09 14:55:16] ppocr INFO: Metric :
[2024/05/09 14:55:16] ppocr INFO: base_metric_name : DetMetric
[2024/05/09 14:55:16] ppocr INFO: key : Student
[2024/05/09 14:55:16] ppocr INFO: main_indicator : hmean
[2024/05/09 14:55:16] ppocr INFO: name : DistillationMetric
[2024/05/09 14:55:16] ppocr INFO: Optimizer :
[2024/05/09 14:55:16] ppocr INFO: beta1 : 0.9
[2024/05/09 14:55:16] ppocr INFO: beta2 : 0.999
[2024/05/09 14:55:16] ppocr INFO: lr :
[2024/05/09 14:55:16] ppocr INFO: learning_rate : 0.001
[2024/05/09 14:55:16] ppocr INFO: name : Cosine
[2024/05/09 14:55:16] ppocr INFO: warmup_epoch : 2
[2024/05/09 14:55:16] ppocr INFO: name : Adam
[2024/05/09 14:55:16] ppocr INFO: regularizer :
[2024/05/09 14:55:16] ppocr INFO: factor : 5e-05
[2024/05/09 14:55:16] ppocr INFO: name : L2
[2024/05/09 14:55:16] ppocr INFO: PostProcess :
[2024/05/09 14:55:16] ppocr INFO: box_thresh : 0.6
[2024/05/09 14:55:16] ppocr INFO: key : head_out
[2024/05/09 14:55:16] ppocr INFO: max_candidates : 1000
[2024/05/09 14:55:16] ppocr INFO: model_name : ['Student']
[2024/05/09 14:55:16] ppocr INFO: name : DistillationDBPostProcess
[2024/05/09 14:55:16] ppocr INFO: thresh : 0.3
[2024/05/09 14:55:16] ppocr INFO: unclip_ratio : 1.5
[2024/05/09 14:55:16] ppocr INFO: Train :
[2024/05/09 14:55:16] ppocr INFO: dataset :
[2024/05/09 14:55:16] ppocr INFO: data_dir : ./train_data/
[2024/05/09 14:55:16] ppocr INFO: label_file_list : ['./train_data/det/train.txt']
[2024/05/09 14:55:16] ppocr INFO: name : SimpleDataSet
[2024/05/09 14:55:16] ppocr INFO: ratio_list : [1.0]
[2024/05/09 14:55:16] ppocr INFO: transforms :
[2024/05/09 14:55:16] ppocr INFO: DecodeImage :
[2024/05/09 14:55:16] ppocr INFO: channel_first : False
[2024/05/09 14:55:16] ppocr INFO: img_mode : BGR
[2024/05/09 14:55:16] ppocr INFO: DetLabelEncode : None
[2024/05/09 14:55:16] ppocr INFO: CopyPaste : None
[2024/05/09 14:55:16] ppocr INFO: IaaAugment :
[2024/05/09 14:55:16] ppocr INFO: augmenter_args :
[2024/05/09 14:55:16] ppocr INFO: args :
[2024/05/09 14:55:16] ppocr INFO: p : 0.5
[2024/05/09 14:55:16] ppocr INFO: type : Fliplr
[2024/05/09 14:55:16] ppocr INFO: args :
[2024/05/09 14:55:16] ppocr INFO: rotate : [-10, 10]
[2024/05/09 14:55:16] ppocr INFO: type : Affine
[2024/05/09 14:55:16] ppocr INFO: args :
[2024/05/09 14:55:16] ppocr INFO: size : [0.5, 3]
[2024/05/09 14:55:16] ppocr INFO: type : Resize
[2024/05/09 14:55:16] ppocr INFO: EastRandomCropData :
[2024/05/09 14:55:16] ppocr INFO: keep_ratio : True
[2024/05/09 14:55:16] ppocr INFO: max_tries : 50
[2024/05/09 14:55:16] ppocr INFO: size : [960, 960]
[2024/05/09 14:55:16] ppocr INFO: MakeBorderMap :
[2024/05/09 14:55:16] ppocr INFO: shrink_ratio : 0.4
[2024/05/09 14:55:16] ppocr INFO: thresh_max : 0.7
[2024/05/09 14:55:16] ppocr INFO: thresh_min : 0.3
[2024/05/09 14:55:16] ppocr INFO: MakeShrinkMap :
[2024/05/09 14:55:16] ppocr INFO: min_text_size : 8
[2024/05/09 14:55:16] ppocr INFO: shrink_ratio : 0.4
[2024/05/09 14:55:16] ppocr INFO: NormalizeImage :
[2024/05/09 14:55:16] ppocr INFO: mean : [0.485, 0.456, 0.406]
[2024/05/09 14:55:16] ppocr INFO: order : hwc
[2024/05/09 14:55:16] ppocr INFO: scale : 1./255.
[2024/05/09 14:55:16] ppocr INFO: std : [0.229, 0.224, 0.225]
[2024/05/09 14:55:16] ppocr INFO: ToCHWImage : None
[2024/05/09 14:55:16] ppocr INFO: KeepKeys :
[2024/05/09 14:55:16] ppocr INFO: keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2024/05/09 14:55:16] ppocr INFO: loader :
[2024/05/09 14:55:16] ppocr INFO: batch_size_per_card : 8
[2024/05/09 14:55:16] ppocr INFO: drop_last : False
[2024/05/09 14:55:16] ppocr INFO: num_workers : 4
[2024/05/09 14:55:16] ppocr INFO: shuffle : True
[2024/05/09 14:55:16] ppocr INFO: profiler_options : None
[2024/05/09 14:55:16] ppocr INFO: train with paddle 2.6.1 and device Place(cpu)
[2024/05/09 14:55:16] ppocr INFO: Initialize indexs of datasets:['./train_data/det/train.txt']
[2024/05/09 14:55:16] ppocr INFO: Initialize indexs of datasets:['./train_data/det/val.txt']
[2024/05/09 14:55:18] ppocr INFO: train dataloader has 3 iters
[2024/05/09 14:55:18] ppocr INFO: valid dataloader has 6 iters
[2024/05/09 14:55:18] ppocr INFO: load pretrain successful from ./pretrain_models/ch_PP-OCRv3_det_distill_train/ch_PP-OCRv3_det_distill_train/best_accuracy
[2024/05/09 14:55:18] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 400 iterations

@UserWangZz
Copy link
Collaborator

你好,请问使用的是cpu版本的paddle还是gpu版本的paddle

@Lijian500
Copy link
Author

你好,请问使用的是cpu版本的paddle还是gpu版本的paddle

你好,是cpu版本。

后面我强制关闭了训练程序,但在outinput目录下查看时,发现生成了对应的模型文件,也许是我当时数据量太小(30张图片),瞬间就完成了,所以没有产生日志?

@UserWangZz
Copy link
Collaborator

batchsize=8,30张图,一个epoch4个iteration,log_smooth_window : 20,所以应该5个epoch才会打印出一个log。可能与这个有关系

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants