Member-only story

Fine-Tune BART for Translation on WMT16 Dataset (and Train new Tokenizer)

Published in

Towards AI

5 min readMar 13, 2023

BART is a well-known summarization model. Also, it can do the translation task with the appropriate tokenizer for the target language.

I recently attempted to test a new architecture on the translation task and needed to train a tokenizer on my custom dataset. I noticed that creating a new tokenizer using HuggingFace can be challenging. In this story, I will focus on the preprocessing step and briefly mention the fine-tuning since many resources are already available. (Including my how-to guide on training a seq2seq model)

New to NLP? Start by reading about “what tokenization is”.

It is easy to find pre-trained tokenizers for the English language by heading to the Huggingface Hub and selecting any pre-trained models. However, training a new model for a new language can get tricky! Let’s see how it can be done. It starts by loading and preprocessing the dataset.

Loading the Dataset

I decided to use the WMT16 dataset and its Romanian-to-English subset. The load_dataset() function will download and load any available dataset from the Huggingface hub.

You can see the dataset content in Figure 1. The first step is to flatten the dataset since it is not straightforward to read the “translation” key to access the source and target texts. (dataset['train']['translation']['en'] instead of dataset['train']['en']) This is unnecessary, but I found working with the flattened dataset easier. The code below separates the dataset into three variables, flatten, and saves it on the disk.

As is visible in Figure 2, we removed the "translation" dimension from the dataset.

Towards AI

Fine-Tune BART for Translation on WMT16 Dataset (and Train new Tokenizer)

Loading the Dataset

Published in Towards AI

Written by Ala Falaki, PhD

Responses (1)