Linguistics 581

Ngram Exercises

    Exercise:
    Preliminaries
     

    ssh to bulba

    Or just sit down at some machine in the lab.

    Logon

    Next cd to:

      /home/ling581/hmm_tagger/data

    Henceforth this is called the data directory.

    General picture:

    1. This has the data files you will be using for this assignment, in particular:
      1. train.txt
      2. saint_mark
    2. Make sure you can find these. You are welcome to copy these files to your local machine. They are public domain.
    3. Make sure you can access the commands ngram-count and ngram which have been installed on bulba. They are critical to the assignment.
      [gawron@bulba data]$ ngram
      -bash: ngram: command not found
      
      Means you are not accessing the command ngram
      [gawron@ginger big_data]$ ngram
      need at least an -lm file specified
      
      means you are. Test ngram-count as well. You should also be able to read man (for manual) entries for these commands.

      This is done as follows:

      [gawron@ginger big_data]$ man ngram
      [Lots of information appears]
      
      This is good.
      [gawron@ginger big_data]$ man ngram
      No manual entry for ngram
      
      means there is something wrong.
    4. You have read permission in the data directory but you do NOT have write permission. When you execute code in this assignment you may find it convenient to be connected to the data directory, but in that cse you will have to make sure any files you create are created in your home directory by specifying full pathnames like '/home/gawron/compling/ngram_assignment'.
    Exercise:
    Main
     

    Part One

    You create ngram models with the command ngram-count. You use and evaluate them with the command ngram.

    First create an ngram model

    1. Read about the command ngram-count using man as described in the preliminaries.
    2. Learn how to
      1. Create an ngram model of up to order n (we will be going up to 3, that is, up to trigrams).
      2. Map all words to their lower case forms so as not to distinguish "He" from "he" in our model.
      3. Write an ngram counts output file (this is a text file with ngram counts, distinct from the actual language model ("lm") discussed next).
      4. Write a language model file.
      5. Specify a training corpus

    Then create a language model (in your home directory) using

    /home/ling581/hmm_tagger/data/train.txt
    
    as your training corpus. Map all words to their lower case forms and write an ngram counts output file (to your home directory).

    You're done with Part One and there is nothing to hand in! But you can't complete Part Two (which does have something to hand in) without having done Part One.

    Part Two

    Now you will TEST your language model using the ngram command. Read about the ngram command using man and learn how to

    1. Specify an lm file (you will of course be using the one you created in Part One).
    2. Specify what order of ngram (1,2, or 3) you should test on. If you trained a model of order 3 you can test it as either a unigram, bigram, or a trigram model just by specifying the order to ngram.
    3. How to map words to lower case.
    4. How to specify a test file.
    5. How to skip Out of Vocabulary items (OOVs) in the test.
    6. How to generate random sentences from your ngram.
    7. How to run at higher debug levels and get more information from your test.

    Next you will test the language model you created in Part One. Test it three ways, Run it as a trigram model, as a bigram model and as a unigram model. The trigram model should get the the most information from the training corpus, and should therefore have the lowest perplexity score on the test corpus. See if this is true by running the three tests using

    /home/ling581/hmm_tagger/data/test.txt
    
    as your test corpus.

    Hand in

    1. Your three perplexity scores (ppl, not ppl1, in the output you get from ngram). Be sure to tell me which is which.
    2. 3 sentences randomly generated from your language model using the ngram command.

    Answer these questions:

    1. What percentage of the test corpus is OOVs? Repeat your trigram perplexity test ignoring OOVs. Is the perplexity lowered or raised? Why?
    2. Run at debug level 1 and redirect the output to a file using Unix ">". This outputs sentence by sentence perplexity measures. Find the highest perplexity sentence in the corpus. [it would not be hard to write a Python script to do this. The "re" module, which provides for regular expression matching, might be helpful.] Check to see if length is a factor in determining the perplexity of a sentence. If so, why; if not, why not?
    3. Run ngram using the training corpus from Part One as your test file:
      /home/ling581/hmm_tagger/data/train.txt
      
      What happens to the perplexity score? What happens to the OOVs? Why?

    Part Three

    Now run the language model you built in Part One on a new test file:

    /home/ling581/hmm_tagger/data/saint_mark
    
    This is the gospel of Saint Mark. What happens to the log prob and the perplexity scores as compared to your first test in Part Two? Explain why.

    Train a NEW trigram language model using saint_mark as your training file. This is very little data so we will not test this model on any other corpora. But we will run the model on saint_mark itself. Run at both the default debug level and at debug level 1.

    Hand in

    1. The perplexity score for testing a model trained on the Gospel of Saint Mark on the Gospel of Saint Mark
    2. A list of the 10 highest perplexity sentences from the gospel. Arguably these sentences are the most unusual in the gospel, at least in terms of the word sequences that appear elsewhere in it.
    3. There is a puzzle about the number one highest perplexity line. It should be highly unususal but in fact there are a number of other very similar lines. What are the similar lines? What distinguishes this one?
    Wall street
    Journal
     

    A smoothing exercise with a real corpus.