Hello and welcome to the diacritics restoration challenge! The goal is to write a program that automatically adds the missing diacritics for a given text in Romanian. For example, given the sentence "Viorica Dancila a fost chemata si se va duce, miercuri dimineata, la Cotroceni, pentru o intalnire fata in fata cu presedintele Klaus Iohannis" we would like the program to output "Viorica Dancilă a fost chemată și se va duce, miercuri dimineață, la Cotroceni, pentru o întâlnire față în față cu președintele Klaus Iohannis" .
You are provided a training set totalling over 220 million words and you have to submit the results on two evaluation sets, named dev and test, each containing about 2 million words. For more information regarding the data please check the next section. If you are eager to start right away, here's the link to the data (you can obtain the password from one of the instructors). We ask the participants to use for training their systems only the data provided.
In order to have your algorithm's predictions evaluated use the submission page. You can upload the restored files for either of the two sets, dev and test. The results for the dev set will be shown immediately in the public and private leaderboards, while the results for the test set will be shown once the challenge is over, but you can track your submissions on the test leaderboard. The reason for having two separate evaluation sets is to avoid over-fitting – for example, see this Kaggle post-mortem which describes such a case of ovefitting the leaderboard. You are allowed to make at most ten submission per day for either of the evaluation sets. For the test set only the last two submission are taken into account.
The submission file should have the same number of words as the evaluation file, but with the characters a, i, s, t being replaced with their restored version. For the purpose of this challenge we are using the cedilla version of the diacritics for the letters (ş and ţ), although we are well aware that comma below diacritics (ș and ț) are the proper way of using diacritics in the Romanian language.
You can track the performance of the algorithms using the leaderboards:
- Dev Public – aggregates the performance on the dev set from all users. This leaderboard report the best performances of each user.
- Dev: My submissions – is a private page for each user showing all the submissions and their performance.
- Test – shows the performance on the test set of the last two submissions for each user. The performance will be hidden until the deadline is over (the 14th of August, 2018).
The metric used for evaluation word accuracy – the fraction of words that correctly match the words in the ground-truth file. For convenience, here is simple Python function that computes the word accuracy metric:
def word_accuracy(true_words, pred_words): return sum(t == p for t, p in zip(true_words, pred_words)) / len(true_words)
This section describes the data that's used for the competition. The data can be downloaded from this link. To get the password please contact one of the instructors.
The Diacritics restoration text corpus was created by the Speech and Dialogue Research Laboratory (SpeeD). The corpus comprises plain text with (mostly) correct Romanian diacritics. The texts were mainly collected over the Internet and comprise proceedings of the European Parliament (subset of the Europarl corpus), theatre plays, electronic books, scientific papers, transcriptions of several Romanian talk-shows and online news. The raw text was preprocessed and normalized using TextCorpusCleaner, a tool created by SpeeD to uniformize diacritics and hyphens, replace URLs, emails and abbreviations with their spoken form, replace numbers (different formats) with text, handle special characters, lowercase text, remove text in other languages.
The corpus contains several parts, as listed in the table below:
|Part name||Contents||Diacritics||Nr. words||Usage|
|Europarl||Proceedings of the European Parliament||OK||5,275,434||train|
|Miscellaneous||Theatre plays, e-books, scientific papers||OK||9,773,664||train|
Please cite the following papers if you use this corpus in your work:
- Philipp Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit 2005.
- Lucian Petrică, Horia Cucu, Andi Buzo, and Corneliu Burileanu, A robust diacritics restoration system using unreliable raw text data, in the Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU), St. Petersburg, Russia, 2014, pp. 215-221.