Hello and welcome to the diacritics restoration challenge! The goal is to write a program that automatically adds the missing diacritics for a given text in Romanian. For example, given the sentence "Viorica Dancila a fost chemata si se va duce, miercuri dimineata, la Cotroceni, pentru o intalnire fata in fata cu presedintele Klaus Iohannis" we would like the program to output "Viorica Dancilă a fost chemată și se va duce, miercuri dimineață, la Cotroceni, pentru o întâlnire față în față cu președintele Klaus Iohannis" .

You are provided a training set totalling over 220 million words and you have to submit the results on two evaluation sets, named dev and test, each containing about 2 million words. For more information regarding the data please check the next section. If you are eager to start right away, here's the link to the data (you can obtain the password from one of the instructors). We ask the participants to use for training their systems only the data provided.

In order to have your algorithm's predictions evaluated use the submission page. You can upload the restored files for either of the two sets, dev and test. The results for the dev set will be shown immediately in the public and private leaderboards, while the results for the test set will be shown once the challenge is over, but you can track your submissions on the test leaderboard. The reason for having two separate evaluation sets is to avoid over-fitting – for example, see this Kaggle post-mortem which describes such a case of ovefitting the leaderboard. You are allowed to make at most ten submission per day for either of the evaluation sets. For the test set only the last two submission are taken into account.

The submission file should have the same number of words as the evaluation file, but with the characters a, i, s, t being replaced with their restored version. For the purpose of this challenge we are using the cedilla version of the diacritics for the letters (ş and ţ), although we are well aware that comma below diacritics (ș and ț) are the proper way of using diacritics in the Romanian language.

You can track the performance of the algorithms using the leaderboards:

The metric used for evaluation word accuracy – the fraction of words that correctly match the words in the ground-truth file. For convenience, here is simple Python function that computes the word accuracy metric:


def word_accuracy(true_words, pred_words):
	return sum(t == p for t, p in zip(true_words, pred_words)) / len(true_words)

Data

This section describes the data that's used for the competition. The data can be downloaded from this link. To get the password please contact one of the instructors.

The Diacritics restoration text corpus was created by the Speech and Dialogue Research Laboratory (SpeeD). The corpus comprises plain text with (mostly) correct Romanian diacritics. The texts were mainly collected over the Internet and comprise proceedings of the European Parliament (subset of the Europarl corpus), theatre plays, electronic books, scientific papers, transcriptions of several Romanian talk-shows and online news. The raw text was preprocessed and normalized using TextCorpusCleaner, a tool created by SpeeD to uniformize diacritics and hyphens, replace URLs, emails and abbreviations with their spoken form, replace numbers (different formats) with text, handle special characters, lowercase text, remove text in other languages.

The corpus contains several parts, as listed in the table below:

Part name Contents Diacritics Nr. words Usage
Europarl Proceedings of the European Parliament OK 5,275,434 train
Antena3 Online news OK 24,598,321 train
Restored 1,996,510 train
Libertatea Online news OK 36,524,189 train
Restored 34,416,164 train
Realitatea Online news OK 59,479,674 train
Restored 8,342,909 train
Miscellaneous Theatre plays, e-books, scientific papers OK 9,773,664 train
Talkshows Romanian talkshows OK 39,724,182 train
No 2,045,148 dev
No 2,036,599 test
This table enumerates the provided data sources.
These parts of the corpus were initially missing or completely lacking diacritics and were subject to a diacritics restoration process using the system of Petrică et al. (2014), which has a WER of 0.52%. The files with restored diacritics have names comprising the term rdia.
These parts of the corpus denote the evaluation sets and are distributed without diacritics.

Please cite the following papers if you use this corpus in your work: