Assignment I: Ling 522
(due 9/11/08)
Introduction
The data for the assignment consists of the
following files on bulba.
- /opt/corpora/mt/wmt07/devtest/devtest2006.de
- /opt/corpora/mt/wmt07/devtest/devtest2006.en
- /opt/corpora/mt/wmt07/devtest/devtest2006.es
- /opt/corpora/mt/wmt07/devtest/devtest2006.fr
- /opt/corpora/mt/wmt07/devtest/devtest2006-ref.de.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-ref.en.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-ref.es.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-ref.fr.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-src.de.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-src.en.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-src.es.sgm
- /opt/corpora/mt/wmt07/devtest/devtest2006-src.fr.sgm
The file endings/extensions have the following meanings:
- ".de,.fr,.es": German, French, Spanish source
- "-src.{de,fr,es}.sgm": SGML version of German, French. Spanish
source.
- "-ref.{de,fr,es}.sgm": SGML annotated reference (gold-standard) translations into English for German, French. Spanish source.
Line n of each ".de" file should roughly correspond to line "n" of
each "ref.de" file and to the line
annotated "seg id=n" of each "src.de" file. But there are
some errors, for technical reasons.
You will probably find the following Unix command
helpful:
more +235 /opt/corpora/mt/wmt07/devtest/devtest2006.fr
displays the file starting with line 235 placed at the very top of your terminal
screen, one window at a time.
So for example, the following commands
- more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.fr
- more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.de
- more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.es
displays 3 translations of the same utterance as the top line of
the window:
- Nous sommes beaucoup ŕ vouloir une fédération d'États nations.
- Viele von uns streben eine Föderation von Nationalstaaten an. Dies impliziert auch, dass jeder seinen richtigen Platz findet.
- Somos muchos los que queremos una federación de Estados-nación.
And the corresponding English is given by:
more +7 devtest2006.en
which shows:
There are many of us who want a federation of nation states,
which means that each state must find the position that best suits it.
Notice the German and English have extra stuff in them. This happens.
It means there is a sentence alignment problem.
Your tasks
- Revise the sentence alignments to be correct for example #7
shown above.
Notice, you may NOT change any the sentences, and
you may not place something
less than a sentence on a single line.
- Pick a language (not English)
and 5 sentences that are NOT consecutive, at least
3 of which are not in the first 500 lines of the files,
at least 3 of which are over 20 words long. And don't
use example #7. Evaluate the output of an MT system on
those 5 sentences. The system output has been posted on
the web:
MT system output
When you are logged in to bulba, the file can be found at:
/var/www/html/ling582/devtest2006.output
To see the MT system output for the example discussed above do:
more +7 /var/www/html/ling582/devtest2006.output
which gives:
we are very much to want a federation of nation states .
Evaluate as follows:
- Separate fidelity and fluency scores, scale 1 to 5.
5 should be reserved for something is indistinguishable
in fluency or fidelity from the output of
a skilled translator.
- Pick 2 phrases in the source language in 2 separate examples
(4 altogether).
Evaluate the fidelity and fluency of translating just those
phrases.
- Write a few paragraphs justifying your scores.
- By examining the MT system output, try to guess what the input
language was for the MT system (French, Spanish, or German). Give your
reasons.