Implementing your own HMM Tagger

Assignment  

Implement your own hmm tagger

Data  

The data and code you need can be found on bulba under:

/home/ling581/hmm_tagger

Here's a description of the DATA:

    File Type Description
    data/train.tag training tagged training data
    train on this!
    data/really_tiny_train.tag very small subset of tagged training data
    Use this only for debugging
    training phase!
    data/test.txt test untagged test data
    run your baby on this!
    data/test.tag tagged test data
    gold standard for test.txt
    evaluate your baby's performance
    with this!
    data/train.txt the data of train.tag untagged:
    run your tagger on this
    and do real well!
    also: for running your tagger
    without unknown words
    data/tiny_train.txt tiny subset of training data
    the size of test data files
    faster max performance test
    data/tiny_train.tag tiny untagged subset of training data
    faster max performance test
    gold standard for tiny_train.txt
    data/tiny_test.txt subset of test.txt
    but a tiny amount for debugging
    data/tiny_test.tag gold standard for tiny_test.txt
    data/valid.tag development tagged development test data
    data/valid.txt untagged development test data
The corpora are all line-by-line corpora. This means as much as possible, each lines contains a complete sentence or a complete fragment. It also means adjacent lines are not guaranteed to be meaningfully related.

This is good. It means that in training and testing you can process these corpora on a line by line basis, which is easier in many programming languages, including Perl and Python.

Code
Models
 

Also in:

/home/ling581/hmm_tagger

You get Perl and Python versions of three programs, which should give you examples of all the programming idioms and concepts you need to do the assignment, plus an evaluator!

    .prl, .py a baseline tagger
    .prl, .py an implementation of viterbi
    viterbi_verbose a more verbose version of the same implementation
    .prl, .py evaluates a tagger
Note that the viterbi_verbose program is only available in Perl. It uses exactly the same algorithm as the viterbi.prl/viterbi.py program and just outputs a little more information.

It will probably be useful to learn a little about the example HMM that the viterbi implementation uses, especially if you need to think about HMMs a little. Here (ps, pdf) is a simplified picture. This is a coin-tossing HMM that tries to predict the value of the next coin toss. H is the state that predicts the coin toss will be a head; T is the state that predicts it will be a tail. [In the actual viterbi code, these are numbered: H is numbered 1, T is numbered 2].

Theory  

The linked review of the theory this assignment requires ( ps,pdf) includes:

  1. A review of the basic probability model used for HMM tagging
  2. A discussion of the Viterbi algorithm as applied to the simple HMM in viterbi.py/prl for this machine.
  3. A discussion of add .5 smoothing.
Using the
Code models
 

Here I am training and testing the baseline tagger, and then evaluating it:

[tagger]$ tagger data/train.tag data/test.txt > tr_test1.tag
Reading data/train.tag
Finding most common tags
Reading data/test.txt
[tagger]$ evaluate data/test.tag tr_test1.tag 
378598 out of 419872 tags correct
90.17% word accuracy
21.25% sentence accuracy

Here I am testing the model answer you don't yet have:

[gawron@localhost hmm_tagger]$ the_answer.py data/train.tag data/test.txt > hmm_test1.tag
Reading data/train.tag
Reading data/test.txt
[gawron@localhost hmm_tagger]$ evaluate  data/test.tag hmm_test1.tag 
392667 out of 419872 tags correct
93.52% word accuracy
36.09% sentence accuracy
On my home machine, which is dedicated and pretty snazzy, this took 45 minutes to run. Expect one of our lab machines to take well over an hour, possibly two.

Here I am evaluating it on its own training data, to check out its peak performance:

[gawron@localhost hmm_tagger]$ bigram data/train.tag data/train.txt > autotrain.tag
Reading data/train.tag
Reading data/train.txt
[gawron@localhost hmm_tagger]$ evaluate data/train.tag autotrain.tag 
2188285 out of 2308885 tags correct
94.78% word accuracy
42.51% sentence accuracy
By the way, this took many hours to run. I tested on ALL the training data.

Here I am using the viterbi implementation:

[gawron@localhost hmm_tagger]$ ./viterbi.prl
Output:
I typed "./viterbi.prl" here rather than "viterbi.prl" because some of our lab machines have a procedure called viterbi installed on them. Make sure you use the one installed in this directory by typing "./viterbi.prl". [There is also a python version of the same program called viterbi.py. Try both. Study the one in the language you plan to write your assignment in.]

I now need to type in some output for the tagger to tag. After each output string, it returns the best path and gives me a chance to type in another output string:

Output: hht
States: 0121

Output: tth
States: 0211

Output: hththth
States: 02121211

Output: hhhhh
States: 011111

Output: ttttt
States: 022221

Output: hoohah
Illegal symbol: o!
States: 
Hit Ctrl-D to exit. Note this example hmm is just defined for the language [ht]*. The code is here to give you a model for how to implement viterbi.
What to
produce
 

Hand in two things: first, a Python program that works like "tagger" in terms of arguments and output, but replaces the naive tagging algorithm of "tagger" with an HMM tagging algorithm.

So during the training phase it should use the training file argument to build a bigram hmm tagging model of the type we applied in the previous assignment.

The probability model we use looks at the previous tag but not at the previous word. So for the first word of each line we need to hallucinate a previous tag; we will call that tag 'START'. You should therefore hallucinate a start tag as the previous tag at the start of every line in the training data. And you should do the same when you are running your trained tagger on test data.

The probability model being used is reviewed here (ps, pdf).

Here is an example tagging for the sequence "ground control station" with each word aligned above its emission and transition probability.
  ground   control   station
START   V   N   N
  (1.0 * .5)   (.3 * .9)   (.4 * .5)
The numbers are made up, so don't try to link them with the model in the last assignment. This corresponds to the path in which the first occurrence of ground is labeled a verb, control a noun, and station a noun. Note that this assumes there is a "starting" part of speech we call start modeled by the start state. Conceptually it's the "previous tag" every sentence starts with. (But there is no starting word [start])

So what you have to do is to hallucinate a START tag as the $previous_tag that occurs before the first word of each line of the training data (remember you're processing line by line). Remember to hallucinate consistently. You must also hallucinate START as the $previous_tag before the first word of each line of your test data.

During the tagging phase your tagger should use the viterbi algorithm to tag the test file.

Second, hand in the results of evaluating your hmm tagger. It should do better than 90%, hey.

Implementation
tips
 

  1. Remember to run your code first on the tiny and really tiny training and test files provided. You will uncover lots of simple bugs this way.

  2. Eventually the time will come when you need to take your maiden voyage on a big file and run a program that may well take hours. At that point you may want to leave the program running and log off bulba. This is called a batch job. This document tells how to get one running, and even addresses how to send STDOUT and STDERR to different files.
  3. Note: If you have Linux installed on your home machine, then all of this assignment can be done there. Almost every linux (Redhat and Fedora included) includes a FULL python package, more than enough to do this assignment. The only Python modules you need are all imported in the code models you have.

  4. The Viterbi implementation you've been given uses log probabilities. You should do the same. Pay attention to how probabilities are turned into log probabilities in the Viterbi implementation. Do this in your implementation. Pay special attention to the use of -$Inf to represent the log of 0. Pay attention to how multiplication turns into addition and division turns into subtraction.

  5. You will have to do smoothing to get this to do anything useful. Use Add .5 smoothing. Review this concept here (ps, pdf). Remember: for purposes of this assignment, all smoothing is is adjusting your counts so there are no zeroes.

  6. These are line-based corpora. No line is guaranteed to have anything to do with the previous line. Thus, always use line by line processing. Train line by line. Test line by line. DON'T try to remember anything from the previous line when you're tagging the test data.

  7. Python: Dictionaries
    1. Introduction to dictionaries
    2. A discussion of Python dictionaries directly targeting their use for this assignment This discussion assumes you know the basics of Python dictionaries already, so read the above for that.
    Less directly related our current assignment but still interesting:
    1. Python Cookbook recipe for sorting a dictionary by keys (the key-value pairs of a dictionary are UNORDERED, so if you want a list ordered by KEY, here's how)
    2. Python Knowledge Base: a lot of example code using dictionaries.