Skip to content

anujsahani01/English-Marathi-Translation

Repository files navigation

English - Marathi Language Translation

  • Build EN-MR and MR-EN language translation models then merged them into a bidirectional translation model by extracting the encoder and decoder components from each model and creating a new EncoderDecoderModel using the HuggingFace transformers library.
  • Leveraging Huggingface's pretrained multilingual translation models, I developed a English-to-Marathi and Marathi to English translation model by fine-tuning hyperparameters and utilizing AutoModelForSeq2seqLM, AutoTokenizer, AutoConfig classes.
  • Compared three different models i.e.
  • Achieved remarkable results with the Helsinki-NLP which gave a loss rate of 0.5174, surpassing the other models.

Data

The dataset I have collected is a comprehensive collection of English and Marathi translations obtained from various publicly available resources. It contains a total of 3,517,283 rows, making it a substantial dataset for language translation tasks. The dataset size is approximately 451 MB, indicating the richness of the data contained within it. To access and utilize this dataset conveniently, it can be downloaded and loaded using the Hugging Face datasets library.

from datasets import load_dataset
dataset = load_dataset("anujsahani01/English-Marathi")

My Models:

The models can be accessed and tested on Huggingface, using the below links.

Hyperparameter(s) that best suited the model.

After series of experiments and trials i finally found this set of Hyperparameters on which my model performed best.

Helsinki-NLP Mbart AI4Bharat
learning rate : 0.0005 learning rate : 0.0005 learning rate : 0.0005
max_steps : 10000 max_steps : 10000 max_steps : 8000
warmup steps : 50 warmup steps : 50 warmup steps : 50
weight_decay : 0.01 weight_decay : 0.01 weight_decay : 0.01
per_device_train_batch_size : 64 per_device_train_batch_size : 12 per_device_train_batch_size : 12
per_device_eval_batch_size : 64 per_device_eval_batch_size : 12 per_device_eval_batch_size : 12
evaluation_strategy : ‘no’ evaluation_strategy : ‘no’ evaluation_strategy : ‘no’
num_train_epochs : 1 num_train_epochs : 1 num_train_epochs : 1
remove_unused_columns : False remove_unused_columns : False remove_unused_columns : False

Results:

The following losses were obtained for English to Marathi Language Translation Model. The best results were obtained using a fine-tuned Helsinki-NLP model.

Helsinki-NLP Mbart AI4Bharat
0.5174 0.8225 0.9779

The following losses were obtained for Marathi to English Language Translation Model. The best results were obtained using a fine-tuned Mbart model.

Helsinki-NLP Mbart AI4Bharat
0.6818 0.6712 0.7775

Feedback

If you have any feedback, please reach out to me at: LinkedIn

Author: @anujsahani01