Many problems in CL and NLP involve

*disambiguation*, i.e., deciding among several possible choices.This can be characterized as making predictions about 'what word is likely to come next'.

Context helps us do this:

which....

for which...

republic for which...

the republic for which...

and to the republic for which ...

There's an old joke about a kid when taught to say the Pledge of Allegiance at a young age applied his own probability assessments, and thought it somehow conscerned a guy named Richard Stands. There are a couple of sites dedicated to

*Mondegreens*like this as applied to song lyrics. Check out www.kissthisguy.com and The Ants are My Friends for some amusing examples. (Thanks to the KPBS show 'A Way with Words' for the reference. Send in your pledge today!)

## A commonly applied formula for making predictions in context is 'Bayes' Law'

## An example: How does the appearence of 'run' effect our predictions that the next word will be 'amok'?

## What puts the 'N' in 'N-gram'.

## We can use N-grams build a probabilistic finite state machine.

## What happens when we increase

*N*

where:

P(x|y) is the probability of x given y.

P(x) is the overall probability of x.

P(y|x) is the probability of y given x.

P(y) is the overall probability of y.

QUIZ...

Say we have a corpus with a million words, 'run' appears a thousand times, and the phrase 'run amok' appears once.

Let's say further that we're using this corpus as the basis for making judgements about some new data, and we know that the last word encountered was 'run'.

We can apply Bayes' law to estimate the probability that the next word will be 'amok'.

P(amok|run)= P(amok) X P(run|amok) / P(run)

Amok occured once in our million-word corpus, so P(amok) is 1/1,000,000)

Run occured 1000 times, so P(run) = 1,000/1,000,000 = 1/1000.

The only time 'amok' occured, it was preceded by 'run', so P(y|x) = 1/1 = 1.

So we can estimate the probability of 'amok' to be 0.000001 X 1 / 0.001 = 0.001, or one in a thousand.

Note that without consulting this one word of context, our best estimate would have been one in a million.

Our 'run amok' example was a

*bigram*, since it plugged the co-occurrance of two subsequent word forms in our corpus.We could do the same thing with

*trigrams*, counting the number of times sequences of three occur.P(it | for which)=P(it) x P(for which | it) / P(for which).

We can carry this to as many positions as we like. But after 'tri', we tend to just use the numeral (4-grams, 5-grams,...). Hence 'N-grams'.

First we go through our corpus and count every configuration of (say) three words that occurs in it.

I pledge allegence

pledge allegence to

allegence to the

to the flag

...

Then we build a finite state machine where the first n-1word forms serve as the label for the state, and the n'th word form labels the link (see illustration).

Each link is

*weighted*in proportion to its probability as indicated in the training corpus.

QUIZ...

In general, the higher the 'n' the more accurate your model.

The higher your 'n' the larger the number of 'N-grams' you have to count. Here's where

*perplexity*, or the average number of choices you have to make at each point in the sequence becomes an issue.If on average you have to chose among 250 different possibilites at each juncture, then an N-gram requires you to keep count of 250-to-the-nth-power different configurations (this can really add up).

There is also the

*sparse data problem*. As your*n*increases, the higher the chance that new data will present you with configurations never encountered in your training corpus.We an address the sparse data problem with N-grams by also counting the n-1-, n-2-, ..., 1-grams, and using some weighted sum of each of them to derive our probability estimate.