| Assignment |
Implement your own hmm tagger |
|||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data |
The data and code you need can be found on bulba under: /home/ling581/hmm_tagger Here's a description of the DATA:
This is good. It means that in training and testing you can process these corpora on a line by line basis, which is easier in many programming languages, including Perl and Python. |
|||||||||||||||||||||||||||||
|
Code Models |
Also in: /home/ling581/hmm_tagger You get Perl and Python versions of three programs, which should give you examples of all the programming idioms and concepts you need to do the assignment, plus an evaluator!
It will probably be useful to learn a little about the example HMM that the viterbi implementation uses, especially if you need to think about HMMs a little. Here (ps, pdf) is a simplified picture. This is a coin-tossing HMM that tries to predict the value of the next coin toss. H is the state that predicts the coin toss will be a head; T is the state that predicts it will be a tail. [In the actual viterbi code, these are numbered: H is numbered 1, T is numbered 2]. |
|||||||||||||||||||||||||||||
| Theory |   |
The linked review of the theory this assignment requires ( ps,pdf) includes:
|
||||||||||||||||||||||||||||
|
Using the Code models |
Here I am training and testing the baseline tagger, and then evaluating it: [tagger]$ tagger data/train.tag data/test.txt > tr_test1.tag Reading data/train.tag Finding most common tags Reading data/test.txt [tagger]$ evaluate data/test.tag tr_test1.tag 378598 out of 419872 tags correct 90.17% word accuracy 21.25% sentence accuracy Here I am testing the model answer you don't yet have: [gawron@localhost hmm_tagger]$ the_answer.py data/train.tag data/test.txt > hmm_test1.tag Reading data/train.tag Reading data/test.txt [gawron@localhost hmm_tagger]$ evaluate data/test.tag hmm_test1.tag 392667 out of 419872 tags correct 93.52% word accuracy 36.09% sentence accuracyOn my home machine, which is dedicated and pretty snazzy, this took 45 minutes to run. Expect one of our lab machines to take well over an hour, possibly two. Here I am evaluating it on its own training data, to check out its peak performance: [gawron@localhost hmm_tagger]$ bigram data/train.tag data/train.txt > autotrain.tag Reading data/train.tag Reading data/train.txt [gawron@localhost hmm_tagger]$ evaluate data/train.tag autotrain.tag 2188285 out of 2308885 tags correct 94.78% word accuracy 42.51% sentence accuracyBy the way, this took many hours to run. I tested on ALL the training data. Here I am using the viterbi implementation: [gawron@localhost hmm_tagger]$ ./viterbi.prl Output:I typed "./viterbi.prl" here rather than "viterbi.prl" because some of our lab machines have a procedure called viterbi installed on them. Make sure you use the one installed in this directory by typing "./viterbi.prl". [There is also a python version of the same program called viterbi.py. Try both. Study the one in the language you plan to write your assignment in.] I now need to type in some output for the tagger to tag. After each output string, it returns the best path and gives me a chance to type in another output string: Output: hht States: 0121 Output: tth States: 0211 Output: hththth States: 02121211 Output: hhhhh States: 011111 Output: ttttt States: 022221 Output: hoohah Illegal symbol: o! States:Hit Ctrl-D to exit. Note this example hmm is just defined for the language [ht]*. The code is here to give you a model for how to implement viterbi. |
|||||||||||||||||||||||||||||
|
What to produce |
Hand in two things: first, a Python program that works like "tagger" in terms of arguments and output, but replaces the naive tagging algorithm of "tagger" with an HMM tagging algorithm. So during the training phase it should use the training file argument to build a bigram hmm tagging model of the type we applied in the previous assignment. The probability model we use looks at the previous tag but not at the previous word. So for the first word of each line we need to hallucinate a previous tag; we will call that tag 'START'. You should therefore hallucinate a start tag as the previous tag at the start of every line in the training data. And you should do the same when you are running your trained tagger on test data.
The probability model being used is reviewed here (ps, pdf). Here is an example tagging for the sequence "ground control station" with each word aligned above its emission and transition probability.
So what you have to do is to hallucinate a START tag as the $previous_tag that occurs before the first word of each line of the training data (remember you're processing line by line). Remember to hallucinate consistently. You must also hallucinate START as the $previous_tag before the first word of each line of your test data. During the tagging phase your tagger should use the viterbi algorithm to tag the test file. Second, hand in the results of evaluating your hmm tagger. It should do better than 90%, hey. |
|||||||||||||||||||||||||||||
|
Implementation tips |
|