[Experiment] Data cleaning Apr 2024 #517

eu9ene · 2024-04-08T21:58:44Z

Experiment insights

OpusCleaner

legacy cleaning slightly outperforms all OpusCleaner configs (likely due to num_mismatch filter in OpusCleaner)
large FastText model significantly reduces false positives compared to small one
FastText can remove a lot of useful data on cleaner datasets, especially short phrases
alpha ratio filter can remove useful data on cleaner datasets
custom OpusCleaner configs slightly outperform the default one
custom OpusCleaner configs + bicleaner significantly outperform the default one + bicleaner (+5M useful sentences due to removing some cleaning rules)

OpusFilter:

a similar to OpusCleaner config in OpusFilter with auto-tuning performs a lot worse than the OpusCleaner one (likely due to the difference in filters)
OpusFilter with LASER and autotuning performs better than without it but still worse than OpusCleaner (Helsinki folks pointed out that there's a bug in sampling with LASER)
Autotuning with only basic OpusCleaner like filters (no bicleaner or laser) performs better than the OpusCleaner like defaults and better than autotuning with disabled feature selection. Mostly because it trained longer and had more data
Autotuning with enabled LASER and BicleanerAI filters way too much data and underperforms
Autotuned and defaults based OpusCleaner like rules do not outperform OpusCleaner defaults baseline (likely difference in fast text implementation)
(TODO) tune laser and bicleaner separately

Bicleaner AI

I deployed OpusCleaner on GPU with Bicleaner AI support, it's a little slow but works
it's very hard to tune bicleaner thresholds in OpusCleaner
Manual analysis of score distributions and example in Jupyter show that even with 0.9 there are plenty of incorrect translations
Experiment with 0.5 vs 0.8 vs 0.9 for all datasets. 0.8 slightly outperforms 0.5, 0.9 filters too much but also competitive

LASER

also hard to tune in OpusCleaner
LASER 2/3 is slower than LASER 1, requires GPU

Setup

en-ru pair, all data except CCMatrix/NLLB, training backward model (ru-en)

Example config:


datasets:
  # all except ccmatrix and nllb to test filtering
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_ELRC-3075-wikipedia_health/v1
    - opus_ELRC-3855-SWPS_University_Soci/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-5183-SciPar_Ukraine/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC_2922/v1
    - opus_EUbookshop/v2
    - opus_GNOME/v1
    - opus_GlobalVoices/v2018q4
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_News-Commentary/v16
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2013/v1.1
    - opus_TED2020/v1
    - opus_Tanzil/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_UNPC/v1.0
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_WikiTitles/v3
    - opus_Wikipedia/v1.0
    - opus_XLEnt/v1.2
    - opus_ada83/v1
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_tico-19/v2020-10-28
    - opus_tldr-pages/v2023-08-29
    - opus_wikimedia/v20230407
    - mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
    - mtdata_Statmt-news_commentary_wmt18-13-rus-eng
    - mtdata_Tilde-airbaltic-1-eng-rus
    - mtdata_Tilde-czechtourism-1-eng-rus
    - mtdata_Tilde-worldbank-1-eng-rus
    - mtdata_UN-un_dev-1-eng-rus
    - mtdata_UN-un_test-1-eng-rus
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_aug-mix_wmt19
    - sacrebleu_aug-mix_wmt17
    - sacrebleu_aug-mix_wmt15
    - sacrebleu_aug-mix_wmt14
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt20
    - sacrebleu_wmt20
    - sacrebleu_wmt18
    - sacrebleu_wmt16
    - sacrebleu_wmt13
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2008
  # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2008
experiment:
  src: en
  trg: ru
  name: opuscleaner_custom_laser_bicleaner
  vocab: NOT-YET-SUPPORTED
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  best-model: chrf
  split-length: 2000000
  backward-model: NOT-YET-SUPPORTED
  spm-sample-size: 10000000
  spm-vocab-size: 32000
  teacher-ensemble: 1
  mono-max-sentences-src: 500000000
  mono-max-sentences-trg: 500000000
  use-opuscleaner: 'true'
marian-args:
  decoding-teacher:
    precision: float16
    mini-batch-words: '4000'
  training-student:
    early-stopping: '20'
  decoding-backward:
    beam-size: '8'
    mini-batch-words: '2000'
  training-backward:
    after: 10e
  training-teacher:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
taskcluster:
  split-chunks: 10
target-stage: train-backwards

# Conflicts: # Makefile

eu9ene added 19 commits April 5, 2024 17:48

Add opus filter cleaning

98578e9

Replace some OpusCleaner filters with OpusFilter

d19cc74

Use temp dir

99e37e8

Add the script to caching

f50659d

Remove data after decompression

d3179d1

Remove uncompressed data

08a5389

Remove uncompressed data

7014833

Add laser filter, fix script

ed2ea16

Use fast text small in legacy cleaning

4ba83dc

Disable opus cleaner in CI

03ce0fb

Add filtering stats

a4cdd09

Change CI target

5369bb6

Fix tensorboard

0c6a8ab

Enable opus filter

c2228d2

Increase time limit

11faa39

Add custom filters

4c53302

Fix typo

464cd3e

Pre download fast text model

cb1e279

Switch to default filters

13f7ea3

eu9ene changed the title ~~[Experiment] OpusFilter~~ [Experiment] Data cleaning Apr 16, 2024

eu9ene added 4 commits April 17, 2024 11:46

Enable laser in opus cleaner

2a0558a

Trigger CI

147e330

Enable OpusCleaner in CI

5c8dcb1

Update and enable OpusFilter

b8789b5

eu9ene changed the title ~~[Experiment] Data cleaning~~ [Experiment] Data cleaning Apr 2024 Apr 17, 2024

eu9ene added 5 commits April 17, 2024 15:39

Switch to fast text large

3e9616a

Fix preprocess arg

bc02bc1

Disable opus filter

3b10999

Add laser and bicleaner to autotuning

42976d8

Fix graph

aa85db2

eu9ene added 30 commits April 24, 2024 12:45

Fix langs

00186f3

Output stats

ed4ace3

Fix path

1b1a7db

Use 1 gpu for laser

f39d5c2

Disable default config

4ce9519

Fix worker name

1712d0a

Plot clustering

d5fa49b

Add logging

76205a3

Fix logging

9247454

Add log hist

0e52db1

Fix cached scores filter

9080660

Save samples

3d5f1b5

Merge branch 'main' into opus_filter

3df7af4

# Conflicts: # Makefile

Merge branch 'main' into opus_filter

99128f8

Trigger CI

903b3e1

Revert cpu workers

8119781

Merge branch 'main' into opus_filter

5c58a05

Disable laser and bicleaner

263c40d

Enable default config

62cc814

Disable caches

8bd045a

Fix default config

650a3a3

Enable laser bicleaner

83a2506

Disable feature selection, enable tuning

674f3d8

Use temp dir, disable laser and bicleaner

457e298

Enable feature selection

5fbf3f6

Tune simple rules separately

22856b4

Add stats after stage 1

0337b16

Fix decompressing

3cf0e71

Use a single process for scoring

5845c19

Increase number of cluster for stage 2

321147e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Data cleaning Apr 2024 #517

[Experiment] Data cleaning Apr 2024 #517

eu9ene commented Apr 8, 2024 •

edited

[Experiment] Data cleaning Apr 2024 #517

Are you sure you want to change the base?

[Experiment] Data cleaning Apr 2024 #517

Conversation

eu9ene commented Apr 8, 2024 • edited

Experiment insights

OpusCleaner

OpusFilter:

Bicleaner AI

LASER

More questions to explore:

Setup

eu9ene commented Apr 8, 2024 •

edited