Complexity issues have reared their ugly heads again and with the IPO date on your new comp ling startup fast approaching, you have discovered that if your hot new system is going to parse sentences as long as 4 words, you had better limit yourself to a 3-word vocabulary.
Consider the following HMM tagger for tagging texts constructed with a 3 word vocabulary.
Here is a partial state table for the HMM:
| States |
Start V = Verb N = Noun |
|||
|---|---|---|---|---|
| State | Observation | Transition |
Transition Probability |
|
| Start | ground | Start => N | Pr(N | Start)   * | Pr(ground | N) |
| .5   * | .4 | |||
| .20 | ||||
| Start => V | Pr(V | Start)   * | Pr(ground | V) | ||
| .5   * | .3 | |||
| .15 | ||||
| control | Start => N | to be filled in | ||
| Start => V | to be filled in | |||
| station | Start => N | to be filled in | ||
| Start => V | to be filled in | |||
| V | ground | V => N | Pr(N | V)   * | Pr(ground | N) |
| .9   * | .4 | |||
| .36 | ||||
| V => V | Pr(V | V)   * | Pr(ground | V) | ||
| .1   * | .3 | |||
| .03 | ||||
| control | V => N | to be filled in | ||
| V => V | to be filled in | |||
| station | V => N | to be filled in | ||
| V => V | to be filled in | |||
| N | ground | N => N | Pr(N | N)   * | Pr(ground | N) |
| .5   * | .4 | |||
| .02 | ||||
| N => V | Pr(V | N)   * | Pr(ground | V) | ||
| .5   * | .3 | |||
| .15 | ||||
| control | N => V | to be filled in | ||
| N => V | to be filled in | |||
| station | N => V | to be filled in | ||
| N => V | to be filled in | |||
Our HMM tagger always starts in the Start state. Since states always correspond to tags in our model, this corresponds to the assumption that the previous tag at the start of input is always Start.
You can use this tagger to assign the most likely tag sequence to any sequence of words taken from our small vocabulary. Consider the following input to our tagger:
|   | ground |   | ground | |
| start |   | V |   | N |
|   | (.3 * .5) |   | (.4 * .9) |
The probability model being used is the following.
| P(wi | t i) | * | P(ti | t i-1) |
For our example, according to this probability model we calculate the joint probability to be:
| ground |   | ground | |||||
| .3 | * | .5 | * | .4 | * | .9 | = .054 |
| Pr(ground|V) | * | Pr(V | Start) | * | Pr(ground|N) | * | Pr(N|V) | |
To find the most likely assignment of tags we need to find the most probable path through the HMM. This is what the Viterbi algorithm is for.
Part A
The above HMM was given with only a partial probability model. Here is the entire probability model:
|   | Pr(wi | ti) | ||
|---|---|---|---|
| w | Pr(w | N ) | Pr(w | V ) | Pr(w | Start ) |
| ground | .4 | .3 | 0 |
| control | .3 | .3 | 0 |
| station | .3 | .4 | 0 |
| Pr(ti | ti-1) | ||
|---|---|---|
| N | V | Start |
|
Pr(V | N) .5 |
Pr(V | V) .1 |
Pr(V | Start) .5 |
|
Pr(N | N) .5 |
Pr(N | V) .9 |
Pr(N | Start) .5 |
Part B
The second part of the tagging problem is to tag the following input:
But the assigned problem is to choose the most probable path by using the Viterbi algorithm.
To help you get started, here is the partial Viterbi matrix for our HMM and the given input:
| V |   |   |   |   |
| N |   |   |   |   |
| Start | 1.0 |   |   |   |
|   |   | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Using the results of your Viterbi calculation, give the most probable state sequence through the HMM:
Write a program that produces a probability model for for an HMM bigram tagger using a tagged corpus. For a review of what the probability model is, look here. NOTE: You are NOT being asked to write a tagger, just a program that produces the probability model such a tagger uses.
To help you out here are some models to modify:
Here I am training and testing the baseline tagger:
[tagger]$ tagger data/train.tag data/test.txt > tr_test1.tag Reading data/train.tag Finding most common tags Reading data/test.txttrain.tag is a file with tagged data in it. The first few lines look like this:
FACTSHEET_NN1 WHAT_DTQ IS_VBZ AIDS_NN1 ?_? AIDS_NN1 (_( Acquired_NP0 Immune_AJ0 Deficiency_NN1 Syndrome_NP0 )_) is_VBZ a_AT0 condition_NN1 caused_VVN by_PRP a_AT0 virus_NN1 called_VVD HIV_NP0 (_( Human_AJ0 Immuno_NN1 Deficiency_NP0 Virus_NP0 )_) ._. How_AVQ is_VBZ infection_NN1 transmitted_VVD ?_? through_PRP unprotected_AJ0 sexual_AJ0 intercourse_NN1 with_PRP an_AT0 infected_AJ0 partner_NN1 ._. through_PRP infected_AJ0 blood_NN1 or_CJC blood_NN1 products_NN2 ._. from_PRP an_AT0 infected_AJ0 mother_NN1 to_PRP her_DPS baby_NN1 ._.Each word is connected to its tag by an underscore ("_"), so you need to separate these two, keep count of how many times each word and tag co-occur, and keep track of tag "bigrams" as well.
The data and code you need can be found on bulba under:
/home/ling581/hmm_taggerHere's a description of the DATA:
The corpora are all line-by-line corpora. This means as much as possible, each lines contains a complete sentence or a complete fragment. It also means adjacent lines are not guaranteed to be meaningfully related.
File Type Description data/train.tag training tagged training data
train on this!data/really_tiny_train.tag very small subset of tagged training data
Use this only for debugging
training phase!data/test.txt test untagged test data
run your baby on this!data/test.tag tagged test data
gold standard for test.txt
evaluate your baby's performance
with this!data/train.txt the data of train.tag untagged:
run your tagger on this
and do real well!
also: for running your tagger
without unknown wordsdata/tiny_train.txt tiny subset of training data
the size of test data files
faster max performance test
data/tiny_train.tag tiny untagged subset of training data
faster max performance test
gold standard for tiny_train.txtdata/tiny_test.txt subset of test.txt
but a tiny amount for debuggingdata/tiny_test.tag gold standard for tiny_test.txt data/valid.tag development tagged development test data data/valid.txt untagged development test data This is good. It means that in training and testing you can process these corpora on a line by line basis, which is easier in many programming languages, including Perl and Python.
The large training file is also here.Hint about example code: : For this assignment all you need to pay attention to is the first part of the code, the training step That part of the code ends here in the python code.
fsock_train.close()You need to hand in a proper HMM probability model for the corpus train.tag. That will consist of the following:
- Word-tag model: For each word tag pair, the probability of the word given the tag.
- Tag-tag model For each tag tag pair (t1,t2), the probability of t2 given t1.
Your probability model should be output to a file which you will hand in. The format is the following. For the word tag model
word tag probEach line contains just the word, the tag, and the probability, in that order separated by nothing but white space. For the tag-tag model the format istag tag probThe two tag models should come in the order given above, word-tag model followed by tag- tag model, and they should be separated by a line containing the following:***END WORD TAG MODEL***Here is some python code illustrating how to output stuff to afile:
try: fsock_out=open(out_file_s,'w',0) ## Open file for writing print >> sys.stderr, 'Writing to %s' % out_file ## Mesage to STD_OUT word_freq_list = word_count.items() ## Make a list of pairs from ## from a dictionary for item in word_freq_list: # print each pair to the file separated by tabs. print >> fsock_out, '%s\t%s' % (item[0],item[1]) # Close the file handle (Good citizenship!) fsock_out.close()