Europarl-direct
Description
In the Europarl corpus, Version 6, as released by P. Koehn (http://www.statmt.org/europarl/), there are language tags indicating the original source language in which a certain statement has been uttered by the speaker in the European Parliament, e.g.
<SPEAKER ID=6 LANGUAGE="IT" NAME="Segni">
Madam President, coinciding with this year's first part-session of the European Parliament, a date has been set, unfortunately for next Thursday, in Texas in America, for the execution of a young 34 year-old man who has been sentenced to death. We shall call him Mr Hicks.
In the corpus provided as is, such language tags are however only given scarcely, so that for all the segments where there is no such tag, one cannot know what the original language was and if it was translated directly to the target language (or via a pivot language).
Our directional extractions therefore account for two corrections:
- First, we 'disseminated' the correct language tags in the full corpus (i.e. in some language files, the tags are given and in others not, so we first correct the lacking tags throughout all languages and for all tags).
- Based on the presence of these tags, we extract all the segments for a given language pair (i.e. for EN --> FR, we search for all segments containing LANGUAGE="EN" in the English files of the corpus, and we get all the corresponding segments from the French files of the corpus, so that we end up with segments for which we know that the original source language of a statement was English and that it directly has been translated to French.
Such directional corpora are a valuable resource for linguistic studies that want to account for translation variability and universals such as the explicitation hypothesis (see e.g. Cartoni et al., 2011). The datasets are also worthwile for Machine Translation. Ozdowska (2009) for example, has shown that SMT system trained on directional corpora might outperform ones that are trained on the original, parallel only corpus.
Citation
If you use Europarl, as well as our directional extractions for your research, please cite the following two papers:
@InProceedings{Koehn-Europarl-2005,
Author = {Koehn, Philipp},
Title = {Europarl: A Parallel Corpus for Statistical Machine Translation},
BookTitle = {Proceedings of MT Summit X},
address = {Phuket, Thailand},
Pages = {79--86},
year = {2005}
}
@InProceedings{Cartoni-Directional-2012,
Author = {Cartoni, Bruno and Meyer, Thomas},
Title = {Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies},
BookTitle = {Proceedings 8th International Conference on Language Resources and Evaluation (LREC)},
address = {Istanbul, Turkey},
year = {2012}
}
Structure
In this archive, we provide the following directional corpus extractions based on Europarl Version 6.
EN_to_FR
FR_to_EN
EN_to_IT
IT_to_EN
FR_to_IT
IT_to_FR
EN_to_ES
Usage
To use the directional corpora for training Statistical Machine Translation systems, for example, you may want to:
- cat the daily text files to one big text file each for the source and target language
- delete the tags (all lines starting with <)
- use the script 'clean-corpus-n.perl' distributed with the Moses SMT toolkit (Koehn et al., 2007) to filter out special characters, empty lines and sentences that are too long (cf. http://www.statmt.org/moses_steps.html)
ES_to_EN
FR_to_ES
ES_to_FR
EN_to_DE
DE_to_EN
EN_to_CZ
CZ_to_EN
For each language direction there is an own directory, containing two sub-directories:
XX_source
YY_target
each containing the daily text files with the names preserved from the original Europarl distribution. The file ending was changed from .txt to .ctags.out (corrected tags) and the files (as the overall size of the corpora) is now of course smaller as they only contain the directly corresponding and translated statements.
For statistics, please see the LREC paper above.