Europarl Release v2 -- Dec 4, 2003 ================================== This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research. More information can be found at http://www.isi.edu/~koehn/europarl/ The main difference in this release vs. the first release in 2002 is that it is larger and it comes with a sentence aligner that allows the creation of parallel corpora between any two of the 11 languages. Sentence aligner ---------------- You can create any parallel corpus with the command ./sentence-align-corpus.perl L1 L2 where L1 and L2 can be any of the 11 languages da de el en es fi fr it nl pt sv The output is stored in the aligned/ directory. NOTE: To use this corpus with tools like Giza++, you want to - lowercase the text (recommended) - strip empty lines and their correspondences (recommended) - remove lines with XML-Tags (starting with "<") (required) The sentence aligner uses the preprocess.perl script, which does tokenization and sentence splitting. You may want to use your own preprocessor. This requires changing an abvious line in the sentence aligner code. Creating a parallel corpus takes about half an hour on a 2GHz Linux machine. Source ------ http://www3.europarl.eu.int/omk/omnsapir.so/calendar?APP=CRE&LANGUE=EN Copyright in the Europarl service (c) European Communities Except where otherwise indicated, reproduction is authorised, provided that the source is acknowledged. Change Log ---------- Preprocessing is improved. This release covers also 1/2002 - 9/2003. Includes sentence aligner.