Hello and welcome to the diacritics restoration challenge! The goal of the challenge is to write a program that automatically adds the missing diacritics for a given text in Romanian. For example, given the sentence "Viorica Dancila a fost chemata si se va duce, miercuri dimineata, la Cotroceni, pentru o intalnire fata in fata cu presedintele Klaus Iohannis" we would like the program to output "Viorica Dăncilă a fost chemată și se va duce, miercuri dimineață, la Cotroceni, pentru o întâlnire față în față cu președintele Klaus Iohannis."

You are provided a training set totalling over 39 million words and you have to submit the results on an evaluation set, named test, containing about 2 million words. For more information regarding the data please check the section below. If you are eager to start right away, here's the link to the data. We ask the participants to train their systems only on the data provided, without relying on external sources.

Once you register and log in, you can evaluate your algorithm's predictions through the submission page. There you can upload the restored files for the test set; the submission file should have the same number of characters as the evaluation file, but with the characters a, i, s, t being replaced with their restored version. The results will be shortly available in the public and private leaderboards:

You are allowed to make at most ten submissions per day.

The primary metric used for evaluation is word accuracy computed on words that accept diacritics. For convenience, here is simple Python function that computes this metric:


def word_accuracy(true_words, pred_words):
    is_valid = lambda word: any(c in word for c in 'aăâiîsștț')
    n_correct = sum(t == p for t, p in zip(true_words, pred_words) if is_valid(t))
    n_total = sum(is_valid(t) for t in true_words)
    return n_correct / n_total

On the private leaderboard, in addition to the previously mentioned metric (word acc. dia.) we show three additional metrics: word accuracy computed on all words (word acc. all.), character accuracy computed on characters that accept diacritics (char acc. dia.), character accuracy computed on characters that all characters (char acc. all.).

Data

The data can be downloaded from this link. The corpus consists of transcriptions of talk-shows from multiple Romanian television programs. The training set has about 39.7 million words, while the test set has about 2 million words. The raw text was preprocessed and normalized using TextCorpusCleaner, a tool created by Speech and Dialogue Research Laboratory (SpeeD) to uniformize diacritics and hyphens, replace URLs, emails and abbreviations with their spoken form, replace numbers (different formats) with text, handle special characters, lowercase text, remove text in other languages.

License. We release the data under the Creative Commons BY-NC-ND 3.0 license.

Citation

Please cite the following paper if you use this corpus in your work: