Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Follow publication

Member-only story

Fine-Tune BART for Translation on WMT16 Dataset (and Train new Tokenizer)

Ala Falaki, PhD
Towards AI
Published in
5 min readMar 13, 2023

--

BART is a well-known summarization model. Also, it can do the translation task with the appropriate tokenizer for the target language.

Photo by Etienne Girardet on Unsplash

I recently attempted to test a new architecture on the translation task and needed to train a tokenizer on my custom dataset. I noticed that creating a new tokenizer using HuggingFace can be challenging. In this story, I will focus on the preprocessing step and briefly mention the fine-tuning since many resources are already available. (Including my how-to guide on training a seq2seq model)

New to NLP? Start by reading about “what tokenization is”.

It is easy to find pre-trained tokenizers for the English language by heading to the Huggingface Hub and selecting any pre-trained models. However, training a new model for a new language can get tricky! Let’s see how it can be done. It starts by loading and preprocessing the dataset.

Loading the Dataset

I decided to use the WMT16 dataset and its Romanian-to-English subset. The load_dataset() function will download and load any available dataset from the Huggingface hub.

You can see the dataset content in Figure 1. The first step is to flatten the dataset since it is not straightforward to read the “translation” key to access the source and target texts. (dataset['train']['translation']['en'] instead of dataset['train']['en']) This is unnecessary, but I found working with the flattened dataset easier. The code below separates the dataset into three variables, flatten, and saves it on the disk.

As is visible in Figure 2, we removed the "translation" dimension from the dataset.

--

--

Published in Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Written by Ala Falaki, PhD

Technical Editor @ Towards AI - Write about NLP here. Let's talk on Twitter! https://nlpiation.github.io/

Responses (1)

Write a response