Unofficial Python implementation of Transformer using PyTorch to translate Korean into English.
For this project, the Korean-English translation corpus from AI Hub was utilized to train the Transformer.
For Tokenizaton, I used Pytorch Tokenization using spacy for english and soynlp. But the alternative is to use tokenizers module to train BPEtokenizers from scratch.
* # of Korean Sentences: 60,000
* # of English Sentences: 60,000
```
[KOR]: 11장에서는 예수님이 이번엔 나사로를 무덤에서 불러내어 죽은 자 가운데서 살리셨습니다.
[ENG]: In Chapter 11 Jesus called Lazarus from the tomb and raised him from the dead.
[KOR]: 6.5, 7, 8 사이즈가 몇 개나 더 재입고 될지 제게 알려주시면 감사하겠습니다.
[ENG]: I would feel grateful to know how many stocks will be secured of size 6.5, 7, and 8.
[KOR LEN]: 800000
[ENG LEN]: 800000
soynlp tokenizer:
[KOR]: ['11', '장에서는', '예수님', '이', '이번', '엔', '나사로를', '무덤에서', '불러', '내어', '죽은', '자', '가운데', '서', '살리셨습니다.']
spacy tokenizer:
[ENG]: ['In', 'Chapter', '11', 'Jesus', 'called', 'Lazarus', 'from', 'the', 'tomb', 'and', 'raised', 'him', 'from', 'the', 'dead', '.']
sentenceBPE tokenizers:
[KOR]: ['11', '장', '에서는', '예수', '님이', '이번엔', '나', '사로', '를', '무', '에서', '불러', '내어', '죽은', '자', '가운데', '서', '살', '리', '셨습니다', '.']
[ENG]: ['In', 'Cha', 'pter', '11', 'Jesus', 'called', 'La', 'z', 'ar', 'us', 'from', 'the', 'tomb', 'and', 'raised', 'him', 'from', 'the', 'dead', '.']
```
- Such libraries are necessary to run the program.
torch==1.9.0 spacy==2.2.4 soynlp==0.0.493 tokenizers==0.10.3 torchtesxt==0.10.0 en-core-web-sm==2.1.0
-
By default, the trainer will use the korean-english dataset from AI Hub, in order to use your own dataset, please create a folder with datasets in it and run the script.
-
You can also choose which tokenizers to use
- soynlp tokenizers for korean and spacy tokenizers for english(default)
- BPE Tokenizers from scratch for both kor and eng tokenizers
-
As well as choose to load pre-trained transformer
- False(default)
- True (Addition)
-
Now you can control the hyperparameters as well.
default: python train.py train.py [-h] [--token_type {1,2}] [--file FILE] [--load {True,False}] [--num_epoch NUM_EPOCH] [--nhead NHEAD] [--emb_size EMB_SIZE] [--ffn_hid_dim FFN_HID_DIM] [--batch_size BATCH_SIZE] [--n_layers N_LAYERS] [--dropout DROPOUT] [--variation {True,False}]
-
Predicting
python predict.py --input KOREAN_INPUT