HMM Tagging Problem: Part I

Complexity issues have reared their ugly heads again and with the IPO date on your new comp ling startup fast approaching, you have discovered that if your hot new system is going to parse sentences as long as 4 words, you had better limit yourself to a 3-word vocabulary.

Consider the following HMM tagger for tagging texts constructed with a 3 word vocabulary.

  1. station
  2. ground
  3. control
and a tagset of 2 tags:
  1. V [Verb]
  2. N [Noun]

Here is a partial state table for the HMM:

The table is incomplete. The states and are all present, and all the transitions with non-zero probabilities are present, but the transition probabilities for control and station have been left out.

Our HMM tagger always starts in the Start state. Since states always correspond to tags in our model, this corresponds to the assumption that the previous tag at the start of input is always Start.

You can use this tagger to assign the most likely tag sequence to any sequence of words taken from our small vocabulary. Consider the following input to our tagger:

There are 4 different state sequences that will accept this input:
  1. Start V V
  2. Start V N
  3. Start N N
  4. Start N V
These correspond to 4 different assignments of part-of-speech tags to the two input words (ground ground) Here is each word aligned above its transition and transition probability.
  ground   ground
start   V   N
  (.3 * .5)   (.4 * .9)
So this corresponds to the path in which the first occurrence of ground is labeled a verb, and the second a noun. Let's review where these transition probabilities come from.

The probability model being used is the following.

That is, the joint probability of a word sequence n words long and a tag sequence n + 1 tags long is equal to the product of the probabilities of each word given its tag times the probability of each tag given the previous tag. There is one extra tag because we take the tag at t=0 to be start. For each wi, then, we get a factor:

For our example, according to this probability model we calculate the joint probability to be:

This is the product of the transition probabilities for this path through the HMM. There are three others.

To find the most likely assignment of tags we need to find the most probable path through the HMM. This is what the Viterbi algorithm is for.


Problem 3 Proper

Part A

The above HMM was given with only a partial probability model. Here is the entire probability model:

The first part of the problem is to use this probability model to complete the transition table for the above HMM tagger by filling in the transition probabilities for control and station.

Part B

The second part of the tagging problem is to tag the following input:

This can be done by computing the products of the transition probabilities (called the path probabilities) for all 16 paths through the HMM and choosing the most probable path.

But the assigned problem is to choose the most probable path by using the Viterbi algorithm.

To help you get started, here is the partial Viterbi matrix for our HMM and the given input:

Note that the Viterbi values for t=0 have already been filled in. Continue the matrix and fill in the values for t=1, t=2, and t=3. Show your calculations. Use the Viterbi homework assignment and Viterbi lecture as your model of what to show.
Part C

Using the results of your Viterbi calculation, give the most probable state sequence through the HMM:


HMM Tagging Problem: Part II

Write a program that produces a probability model for for an HMM bigram tagger using a tagged corpus. For a review of what the probability model is, look here. NOTE: You are NOT being asked to write a tagger, just a program that produces the probability model such a tagger uses.

To help you out here are some models to modify:

  1. Baseline tagger (Perl, Python)
  2. New! New! New! Interactive Python session executing relevant bits of Python code!
This tagger executes the "baseline" strategy. For each word it assigns the most frequent tag for that word.

Here I am training and testing the baseline tagger:

[tagger]$ tagger data/train.tag data/test.txt > tr_test1.tag
Reading data/train.tag
Finding most common tags
Reading data/test.txt
train.tag is a file with tagged data in it. The first few lines look like this:
FACTSHEET_NN1 WHAT_DTQ IS_VBZ AIDS_NN1 ?_? 
AIDS_NN1 (_( Acquired_NP0 Immune_AJ0 Deficiency_NN1 Syndrome_NP0 )_) is_VBZ a_AT0 condition_NN1 caused_VVN by_PRP a_AT0 virus_NN1 called_VVD HIV_NP0 (_( Human_AJ0 Immuno_NN1 Deficiency_NP0 Virus_NP0 )_) ._. 
How_AVQ is_VBZ infection_NN1 transmitted_VVD ?_? 
through_PRP unprotected_AJ0 sexual_AJ0 intercourse_NN1 with_PRP an_AT0 infected_AJ0 partner_NN1 ._. 
through_PRP infected_AJ0 blood_NN1 or_CJC blood_NN1 products_NN2 ._. 
from_PRP an_AT0 infected_AJ0 mother_NN1 to_PRP her_DPS baby_NN1 ._. 
Each word is connected to its tag by an underscore ("_"), so you need to separate these two, keep count of how many times each word and tag co-occur, and keep track of tag "bigrams" as well.

The data and code you need can be found on bulba under:

/home/ling581/hmm_tagger

Here's a description of the DATA:

    File Type Description
    data/train.tag training tagged training data
    train on this!
    data/really_tiny_train.tag very small subset of tagged training data
    Use this only for debugging
    training phase!
    data/test.txt test untagged test data
    run your baby on this!
    data/test.tag tagged test data
    gold standard for test.txt
    evaluate your baby's performance
    with this!
    data/train.txt the data of train.tag untagged:
    run your tagger on this
    and do real well!
    also: for running your tagger
    without unknown words
    data/tiny_train.txt tiny subset of training data
    the size of test data files
    faster max performance test
    data/tiny_train.tag tiny untagged subset of training data
    faster max performance test
    gold standard for tiny_train.txt
    data/tiny_test.txt subset of test.txt
    but a tiny amount for debugging
    data/tiny_test.tag gold standard for tiny_test.txt
    data/valid.tag development tagged development test data
    data/valid.txt untagged development test data
The corpora are all line-by-line corpora. This means as much as possible, each lines contains a complete sentence or a complete fragment. It also means adjacent lines are not guaranteed to be meaningfully related.

This is good. It means that in training and testing you can process these corpora on a line by line basis, which is easier in many programming languages, including Perl and Python.

The large training file is also here.

Hint about example code: : For this assignment all you need to pay attention to is the first part of the code, the training step That part of the code ends here in the python code.

        fsock_train.close()

You need to hand in a proper HMM probability model for the corpus train.tag. That will consist of the following:

  1. Word-tag model: For each word tag pair, the probability of the word given the tag.
  2. Tag-tag model For each tag tag pair (t1,t2), the probability of t2 given t1.

Your probability model should be output to a file which you will hand in. The format is the following. For the word tag model

word    tag    prob
Each line contains just the word, the tag, and the probability, in that order separated by nothing but white space. For the tag-tag model the format is
tag  tag  prob
The two tag models should come in the order given above, word-tag model followed by tag- tag model, and they should be separated by a line containing the following:
***END WORD TAG MODEL***

Here is some python code illustrating how to output stuff to afile:

    try:
        fsock_out=open(out_file_s,'w',0)  ## Open file for writing
        print >> sys.stderr, 'Writing to %s' % out_file  ## Mesage to STD_OUT
        word_freq_list = word_count.items()  ## Make a list of pairs from
                                             ## from a dictionary
        for item in word_freq_list:
	    # print each pair to the file separated by tabs.
            print >> fsock_out, '%s\t%s' % (item[0],item[1]) 
        # Close the file handle (Good citizenship!)
        fsock_out.close()