fix dataset problems release 2.7 #11646

yiakwy-xpu-ml-framework-team · 2024-02-29T06:51:23Z

PR 类型 PR types

Bug fixes

PR 变化内容类型 PR changes

Dataset, preprocessing :

remove duplicated data loads of simple dataset
handling detection task for CTW dataset where annotation polyline data contains more than 4 points (otherwise, resulting forever loop in dataset)
handling inChannels scalars
fasten data loading
improve model loading

描述 Description

improve codes stability and fix bugs when training with cn_PP_OCR_v4 detection model.

I identified failures of dataset infinite loading problem when reproducing a OCR dec in local env. My team is using this model to generate high quality data for Multi-Modal LLM. After this fix , dec can be trained very quickly for a few epochs :

SimpleDataset forever loop :

for dataset such as ctw1500, each annotation of points of which contains more than 4 points to detect location of arbitrary text in the image. The points may be of the shape [2/*the number of annotation in this image*/, 14/*number of points*/, 2/*2d coordinates*/].

The utility function should also be adapted to process the annotation data instead of throwing them away or raising an exception.

Note by default, "CopyPaste" for imgaug method is chosen. That means arbitrary of two images are selected and paired. However get_ext_data always read a new image from the dataset, which is very inefficient.

提PR之前的检查 Check-list

这个 PR 是提交到dygraph分支或者是一个cherry-pick，否则请先提交到dygarph分支。
This PR is pushed to the dygraph branch or cherry-picked from the dygraph branch. Otherwise, please push your changes to the dygraph branch.
这个PR清楚描述了功能，帮助评审能提升效率。This PR have fully described what it does such that reviewers can speedup.
这个PR已经经过本地测试。This PR can be covered by existing tests or locally verified.

- remove unlimited loop for simpledatset when poly has more than 4 points - fasten data loading - improve model loading

paddle-bot · 2024-02-29T06:51:28Z

Thanks for your contribution!

CLAassistant · 2024-02-29T06:51:29Z

All committers have signed the CLA.

tink2123

Thank you for your contribution！

Your acceleration solution is effective in some scenes. Have you considered adding a parameter to the configuration file to select this way ?

Also, can you please submit a PR to the dygraph branch. This modification will be merged into the next stable branch.

tink2123 · 2024-03-18T06:17:26Z

ppocr/data/simple_dataset.py

            if data is None:
                continue
-            if 'polys' in data.keys():
-                if data['polys'].shape[1] != 4:
-                    continue


need to handle this situation

Hi, this is where we need to discuss.

This condition leaded to endless loop and discarded every annotation line with points more than 4 in detection task.

I have checked annotation files we generated and official dataset from ctw1500 (800M) and ICDAR2015 (80M) datasets which are used for finetuning and evaluation: they both have annotation lines with more than 4 points.

The root cause is utility file can only process polygon with fixed points:

assert len(points) == 4, "shape of points must be 4*2"

Hence we deleted the unnecessary constrain (4 points bbox annotation is just a simple case of general closed polyline) and training goes smoothly again.

ppocr/data/simple_dataset.py

tink2123 · 2024-03-18T06:23:33Z

ppocr/utils/network.py

@@ -75,6 +75,9 @@ def maybe_download_params(model_path):
    else:
        url = model_path
    tmp_path = os.path.join(MODELS_DIR, url.split('/')[-1])
+    if os.path.exists(tmp_path) and os.path.isfile(tmp_path):


can be deleted

Hi models from Model_Path was downloaded to tmp_path but it is repeated to do so each time when we launch a task in the same machine, which takes about 3-5 minutes. This is unnecessary.

A better way I think is, by providing MD5 sum in model configuration file, we can check if the files in tmp_path is correct.

But any change to the config in this project requires work checking all the other configuration files. That will be too cumbersome.

tink2123 · 2024-03-18T06:25:53Z

tools/infer/utility.py

@@ -609,8 +610,31 @@ def get_rotate_crop_image(img, points):
    img_crop = img[top:bottom, left:right, :].copy()
    points[:, 0] = points[:, 0] - left
    points[:, 1] = points[:, 1] - top
-    '''
-    assert len(points) == 4, "shape of points must be 4*2"


Situations where more data formats need to be processed in addition to CTW

Yes. But both in CTW, ICDAR2015 and our own annotation files we don't assume bbox consists of fixed number number of points to form a valid bbox polyline.

So we just treat it as general polyline: 2 points (upper left, bottom right), 3 points (trianguulars), and closed polylines

yiakwy-xpu-ml-framework-team · 2024-03-18T07:36:42Z

B.t.w I just found this line when I scan the dataset file:

resized_image = resized_image.astype('float32')

The image can be effectively worked with uint8, this allows 8-bit IO workload and better compute-IO overlap.

It seems that we havn't use DALI to accelerate IO by allowing device prefetching and on-chip preprocessing.

jzhang533 · 2024-03-29T05:24:32Z

Can we land this one ? @tink2123

jzhang533 · 2024-03-29T05:53:32Z

@yiakwy-xpu-ml-framework-team you'll need to create a PR, target for dygraph branch. we decided not merging this into release/2.7.

yiakwy-xpu-ml-framework-team · 2024-04-01T10:10:24Z

@yiakwy-xpu-ml-framework-team you'll need to create a PR, target for dygraph branch. we decided not merging this into release/2.7.

I guess I can cherry-pick this commit onto the dygraph branch. I will do it soon.

fix dataset problems release 2.7

d56b67b

- remove unlimited loop for simpledatset when poly has more than 4 points - fasten data loading - improve model loading

paddle-bot bot added the contributor label Feb 29, 2024

paddle-bot bot assigned Sunting78 Feb 29, 2024

remove comment

0489f7c

yiakwy-xpu-ml-framework-team mentioned this pull request Feb 29, 2024

0.00000 combineloss when using ch_PP-OCRv4_det_cml.yaml #11507

Open

jzhang533 requested a review from tink2123 March 18, 2024 03:26

tink2123 reviewed Mar 18, 2024

View reviewed changes

clean commit

f82c69f

jzhang533 changed the base branch from release/2.7 to dygraph March 29, 2024 05:51

jzhang533 changed the base branch from dygraph to release/2.7 March 29, 2024 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix dataset problems release 2.7 #11646

fix dataset problems release 2.7 #11646

yiakwy-xpu-ml-framework-team commented Feb 29, 2024 •

edited

paddle-bot bot commented Feb 29, 2024

CLAassistant commented Feb 29, 2024 •

edited

tink2123 left a comment

tink2123 Mar 18, 2024

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited

tink2123 Mar 18, 2024

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited

tink2123 Mar 18, 2024

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited

yiakwy-xpu-ml-framework-team commented Mar 18, 2024

jzhang533 commented Mar 29, 2024

jzhang533 commented Mar 29, 2024

yiakwy-xpu-ml-framework-team commented Apr 1, 2024

fix dataset problems release 2.7 #11646

Are you sure you want to change the base?

fix dataset problems release 2.7 #11646

Conversation

yiakwy-xpu-ml-framework-team commented Feb 29, 2024 • edited

PR 类型 PR types

PR 变化内容类型 PR changes

描述 Description

提PR之前的检查 Check-list

paddle-bot bot commented Feb 29, 2024

CLAassistant commented Feb 29, 2024 • edited

tink2123 left a comment

Choose a reason for hiding this comment

tink2123 Mar 18, 2024

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Mar 18, 2024 • edited

Choose a reason for hiding this comment

tink2123 Mar 18, 2024

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Mar 18, 2024 • edited

Choose a reason for hiding this comment

tink2123 Mar 18, 2024

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Mar 18, 2024 • edited

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team commented Mar 18, 2024

jzhang533 commented Mar 29, 2024

jzhang533 commented Mar 29, 2024

yiakwy-xpu-ml-framework-team commented Apr 1, 2024

yiakwy-xpu-ml-framework-team commented Feb 29, 2024 •

edited

CLAassistant commented Feb 29, 2024 •

edited

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited

yiakwy-xpu-ml-framework-team Mar 18, 2024 •

edited