Assignment One

Assignment I: Ling 522

(due 9/11/08)

Introduction

The data for the assignment consists of the following files on bulba.

  1. /opt/corpora/mt/wmt07/devtest/devtest2006.de
  2. /opt/corpora/mt/wmt07/devtest/devtest2006.en
  3. /opt/corpora/mt/wmt07/devtest/devtest2006.es
  4. /opt/corpora/mt/wmt07/devtest/devtest2006.fr
  5. /opt/corpora/mt/wmt07/devtest/devtest2006-ref.de.sgm
  6. /opt/corpora/mt/wmt07/devtest/devtest2006-ref.en.sgm
  7. /opt/corpora/mt/wmt07/devtest/devtest2006-ref.es.sgm
  8. /opt/corpora/mt/wmt07/devtest/devtest2006-ref.fr.sgm
  9. /opt/corpora/mt/wmt07/devtest/devtest2006-src.de.sgm
  10. /opt/corpora/mt/wmt07/devtest/devtest2006-src.en.sgm
  11. /opt/corpora/mt/wmt07/devtest/devtest2006-src.es.sgm
  12. /opt/corpora/mt/wmt07/devtest/devtest2006-src.fr.sgm

The file endings/extensions have the following meanings:

  1. ".de,.fr,.es": German, French, Spanish source
  2. "-src.{de,fr,es}.sgm": SGML version of German, French. Spanish source.
  3. "-ref.{de,fr,es}.sgm": SGML annotated reference (gold-standard) translations into English for German, French. Spanish source.

Line n of each ".de" file should roughly correspond to line "n" of each "ref.de" file and to the line annotated "seg id=n" of each "src.de" file. But there are some errors, for technical reasons.

You will probably find the following Unix command helpful:

    more +235 /opt/corpora/mt/wmt07/devtest/devtest2006.fr
displays the file starting with line 235 placed at the very top of your terminal screen, one window at a time.

So for example, the following commands

  1. more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.fr
  2. more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.de
  3. more +7 /opt/corpora/mt/wmt07/devtest/devtest2006.es
displays 3 translations of the same utterance as the top line of the window:
  1. Nous sommes beaucoup ŕ vouloir une fédération d'États nations.
  2. Viele von uns streben eine Föderation von Nationalstaaten an. Dies impliziert auch, dass jeder seinen richtigen Platz findet.
  3. Somos muchos los que queremos una federación de Estados-nación.
And the corresponding English is given by:
    more +7 devtest2006.en
which shows:
    There are many of us who want a federation of nation states, which means that each state must find the position that best suits it.
Notice the German and English have extra stuff in them. This happens. It means there is a sentence alignment problem.

Your tasks

  1. Revise the sentence alignments to be correct for example #7 shown above. Notice, you may NOT change any the sentences, and you may not place something less than a sentence on a single line.
  2. Pick a language (not English) and 5 sentences that are NOT consecutive, at least 3 of which are not in the first 500 lines of the files, at least 3 of which are over 20 words long. And don't use example #7. Evaluate the output of an MT system on those 5 sentences. The system output has been posted on the web:
      MT system output
    When you are logged in to bulba, the file can be found at:
      /var/www/html/ling582/devtest2006.output
    To see the MT system output for the example discussed above do:
      more +7 /var/www/html/ling582/devtest2006.output
    which gives:
      we are very much to want a federation of nation states .
    Evaluate as follows:
    1. Separate fidelity and fluency scores, scale 1 to 5. 5 should be reserved for something is indistinguishable in fluency or fidelity from the output of a skilled translator.
    2. Pick 2 phrases in the source language in 2 separate examples (4 altogether). Evaluate the fidelity and fluency of translating just those phrases.
    3. Write a few paragraphs justifying your scores.
  3. By examining the MT system output, try to guess what the input language was for the MT system (French, Spanish, or German). Give your reasons.