Exercise:
Main
|
 
|
Part One
You create ngram models with the command ngram-count.
You use and evaluate them with the command ngram.
First create an ngram model
- Read about the command ngram-count using man as described
in the preliminaries.
- Learn how to
- Create an ngram model of up to order n (we will be going up to 3,
that is, up to trigrams).
- Map all words to their lower case forms so as not to distinguish "He"
from "he" in our model.
- Write an ngram counts output file (this is a text file with ngram counts,
distinct from the actual language model ("lm") discussed next).
- Write a language model file.
- Specify a training corpus
Then create a language model (in your home directory) using
/home/ling581/hmm_tagger/data/train.txt
as your training corpus. Map all words to their lower case forms
and write an ngram counts output file (to your home directory).
You're done with Part One and there is nothing to hand in!
But you can't complete Part Two (which does have something to
hand in) without having done Part One.
Part Two
Now you will TEST your language model using the ngram command.
Read about the ngram command using man and learn how to
- Specify an lm file (you will of course be using the one you created in Part One).
- Specify what order of ngram (1,2, or 3) you should test on. If
you trained a model of order 3 you can test it as either a unigram,
bigram, or a trigram model just by specifying the order to ngram.
- How to map words to lower case.
- How to specify a test file.
- How to skip Out of Vocabulary items (OOVs) in the test.
- How to generate random sentences from your ngram.
- How to run at higher debug levels and get more information
from your test.
Next you will test the language model you created in Part One. Test it three ways,
Run it as a trigram model, as a bigram model and as a unigram model. The trigram model
should get the the most information from the training corpus, and should therefore
have the lowest perplexity score on the test corpus. See if this is true by running
the three tests using
/home/ling581/hmm_tagger/data/test.txt
as your test corpus.
Hand in
- Your three perplexity scores (ppl, not ppl1, in the
output you get from ngram). Be sure to tell me which is which.
- 3 sentences randomly generated from your language model
using the ngram command.
Answer these questions:
- What percentage of the test corpus is OOVs? Repeat
your trigram perplexity test ignoring OOVs. Is the perplexity
lowered or raised? Why?
- Run at debug level 1 and redirect the output to a file using
Unix ">". This outputs sentence by sentence perplexity measures.
Find the highest perplexity sentence in the corpus.
[it would not be hard to write a Python script to do this. The "re"
module, which provides for regular expression matching, might be
helpful.] Check to see if length is a factor in
determining the perplexity of a sentence. If so, why;
if not, why not?
- Run ngram using the training corpus from
Part One as your test file:
/home/ling581/hmm_tagger/data/train.txt
What happens
to the perplexity score? What happens to the OOVs?
Why?
Part Three
Now run the language model you built in Part One
on a new test file:
/home/ling581/hmm_tagger/data/saint_mark
This is the gospel of Saint Mark. What happens to the log
prob and the perplexity scores as compared to your
first test in Part Two? Explain why.
Train a NEW trigram language model using saint_mark as
your training file. This is very little data so we will
not test this model on any other corpora. But we
will run the model on saint_mark itself. Run at both the
default debug level and at debug level 1.
Hand in
- The perplexity score for testing a model trained on the Gospel of
Saint Mark on the Gospel of Saint Mark
- A list of the 10 highest perplexity sentences from the gospel.
Arguably these sentences are the most unusual in the gospel,
at least in terms of the word sequences that appear elsewhere
in it.
- There is a puzzle about the number one highest perplexity
line. It should be highly unususal but in fact there are a number of other
very similar lines. What are the similar lines?
What distinguishes this one?
|