Linguistics 581

Ngram language models


Terminology

The important terms introduced at the beginning of the chapter:
  1. lemma: a set of lexical forms having the same stem, the same major part of speech, and the same word sense. Lemmatization is often used a s a synonym of stemming, but properly it includes the problem of word sense recognition:
    1. a flight of steps
    2. their flight from reality
    3. the flight to Rio
    Properly speaking the three italicized forms belong to three different lemmas, which all include the same plural form flights.
  2. type: The word as opposed to a particular occurrence of the word. The number of types in a corpus is the size of the vocabulary.
  3. token: A particular occurence of a word in speech or in text. The number of tokens in a corpus is the size of the corpus. The Brown corpus had one million tokens and many, many fewer types.

Simple N-grams

For a variety of reason we want to be able to predict the next word given a sequence of words.

Applications where this might matter:
  1. augmentation communication systems (systems allowing disabled speakers to communicate with minimal effort)
  2. speech recognition
  3. Context-sensitive spelling correction
    1. They are leaving in about fifteen minuets to go home.
    2. I need to notified the bank of this problem.
  4. Discovering collocations, technical terminology, complex names (for a variety of applications, including MT)
    1. linear function
    2. regression coefficient
    3. Gaussian random variable
    4. New York
    5. powerful computers (? strong computers)
    6. strong support (?powerful support)

We start with the problem: How do we formulate the probability of a string of words?
Chain rule  
  1. Prob(A,B) = Prob(A|B) * Prob (B)
    Prob(A,B) = Prob(B|A) * Prob (A)
  2. Prob(A,B,C) = Prob(C|A,B) * Prob (A,B)
    Prob(A,B,C) = Prob(C|A,B) * Prob(B|A) * Prob (A)
    Prob(A,B,C) = Prob(A) * Prob(B|A) * Prob(C|A,B)
  3. Prob(A,B,C,D) = Prob(D|A,B,C) * Prob (A,B,C)
    Prob(A,B,C,D) = Prob(D|A,B,C) * Prob(A) * Prob(B|A) * Prob(C|A,B)
    Prob(A,B,C,D) = Prob(A) * Prob(B|A) * Prob(C|A,B) * Prob(D|A,B,C)
Prob(A1,..., An) = Prob(A1) * Prob(A2|A1) * ... Prob(An|A1,A2, ... , An-1)
Word
Sequences
  We abbreviate word sequences w1, w2, w3,..., wn as:
  • w1n

We now look at an n-word word sequence w1n as a joint event consisting of word 1 occuring in the first position, word 2 occurring in the second position, and so on.

This lets us use the chain rule:

  • Prob(w1,..., wn) = Prob(w1) * Prob(w2|w1) * ... Prob(wn|w1,w2, ... , wn-1)

Using our abbreviation:

  • Prob(w1n) = Prob(w1) * Prob(w2|w1) * ... Prob(wn|w1n-1)
Sparseness   The problem with this formulation of the probability can be seen by looking at the last term:
  • Prob(wn|w1n-1)
This is the probability of the last word given the entire sequence of words before it.

How would we compute such a thing?

It's easy!

  • |w1n|÷ |w1n-1|
To do this right we need some corpus large enough to give us a representative sampling of of the n-word strings beginning with w1n-1, of which there is hopefully a representative sample of w1n.

The problem is that there isn't enough data in the world to get such representative samples in most cases for even moderately small n.

Consider a very small n. Consider Shakespeare.

  1. Word token Count: 884,647
  2. Word form Types: 29,066 (including lots of proper nouns)
  3. Number of bigram types = 29,0662 = 844 million
  4. Number of bigram tokens: 884,647
In any corpus of this size, we're very unlikely to see most of the rarer bigrams.

Now consider Shakerspeare and trigrams:

  1. Word token Count: 884,647
  2. Word form Types: 29,066 (including lots of proper nouns)
  3. Number of trigram types = 29,0663 = 25,636,000,000,000
  4. Number of trigram tokens: 884,647
We get only a vanishing small sample of the entire space of trigrams. We're likely to encounter only the most common ones. Evidence: After training on 1.5 million words from IBM Laser Patent Text Corpus:
    Bahl, et al (1983) reported that 23% of the trigrams in unseen data were new!
Uggh!

In fact trigrams ina text arent indpenedent:

  1. Each partially overlaps (and partly helps determine) the next one
  2. Content words tend to clump (a word's appearnce in a document is one of the best predictors of its later appearance).
  3. Doesn't take into account auctorial tendencies (the Shakespeare corpus).
Approximation  

Since we can't compute the probability we want directly, we approximate:

  • Prob(wn|w1n-1)=app Prob(wn|wn-1)

Why is this a reasonable approximation?

  • |wn-1n|÷ |wn-1| =app |w1n|÷ |w1n-1|

Take a special case:

      Predicting the next word after distance
     
    • long distance ____
      What we're saying for this special case is that:
    • |long distance call| ÷ |long distance| =app |distance call| ÷ |distance|
    This is is exactly right when adding long in front of distance neither makes call more nor less likely. It has no effect. The occurrence of that that word call is completely independent of of the occurrence of long two words back.

    So what our approximation claims is that the strongest conditioning effect on what comes next is from the last word. To the extent that the word two words back has a conditioning effect we're wrong. In our example, it seems plausible that long is a factor in making the word call more likely to come next.

    What would be a better approximation?

    • |wn-1n|÷ |wn-1| =app |w1n|÷ |w1n-1|
    Look TWO words back. Now we see long. We still have other possible continuations:
    • long distance runner
    But we make much less likely continuations like:
    • distance himself from the opposition

So we see that lloking two words back always (theoretically) gives us better model for predicting the next word than lokinmg just one word back.

But wait if we have:

  • collect long distance ___
we're even more likely to have the word call occur next (running and runner are made much less likely for example).

So in general:
The longer the history we look at, the more precise our approximation will be, until in the limit, it is no longer an approximation, because we look at the entire history.

Bigram
Model
  A probability model that looks one word back to try to predict the next word is called a bigram model.
Trigram
Model
  A probability model that looks two words back to try to predict the next word is called a trigram model.
Berkeley
restaurant
project
  A query system for finding Berkeley restaurants, providing a small corpus of queries about food and restaurants in Berkeley.
  1. I'm looking for cantonese food.
  2. I'd like to eat dinner someplace nearby.
  3. Tell me about Chez Panisse.
the basic
idea
 

We count. For each word, we count up occurrences of following words and array them in a bigram table.

We also total up the total number of occurrences of each word,
I 3437
want 1215
to 3256
eat 938
Chinese 213
food 1506
lunch 459
We compute probabilities.

  • P(wn | wn-1) = |wn-1wn| ÷ |wn |
  • P(to | want) = |want to| ÷ |want| = 786 ÷ 1215 = .65
Now we can build up tables with probabilities instead of counts.


Smoothing

Note that most of the cells in our original count tables will be zero.

  1. We don't see many of the words in English.
  2. We don't see the huge majority of the bigrams of English.
  3. We see only a tiny sliver of all the possible trigramns

Most of the time our bigram model assigns probability zero to a potential following word:

Probability zero means it can't happen. But we aren't entitled to reach that conclusion.
Add-one
Smoothing
basic idea
 

We add one to every cell of this table

We get this table

We recompute our our total occurrences:

  • I 3437 +1616 =5053
  • want 1215 + 1616 = 2931
  • to 3256 + 1616 = 4872
  • eat 938 + 1616 = 2554
  • Chinese 213 + 1616 = 1829
  • food 1506 + 1616 = 3122
  • lunch 459 + 1616 = 2075

Now we recompute the probabilities:

  • P(wn | wn-1) = |wn-1wn| ÷ |wn |
  • P(food | want) = |want to| ÷ |want| = 1 ÷ 2931 = .0003
  • P(to | want) = |want to| ÷ |want| = 787 ÷ 2931 = .27

This gives us this bigram probability table.

Compare this one.

Some things to notice:

  1. The events that used to be zeroes don't all have the same probability.
  2. All the events in the same row that were zeros in the old model get the same probability in the new model.
  3. ALL the non-zero probabilities went down.
  4. Sometimes the change doesn't look very large
    1. P(eat | I)[.0038 -> .0028]
    2. P(I | to)[.00092 -> .00082]
  5. Some very predictable events became less predictable:
    1. P(to|want)[.65-> .22]
    2. P(food|Chinese) [.56 -> .066]
  6. Other probabilities changed by large factors.
    1. P(lunch|Chinese) [.0047 -> .0011]
    2. P(food|want) [.0066 -> .0032]
  7. Likelihood ratios changed
    1. old model: P(I|lunch) = 4 * P(food|lunch)
    2. new model: P(I|lunch) = 2.5 * P(food|lunch)

Conclusion: Increasing the zero probabilities from zero to a small number was good, but the effect on the non-zero probabilities was not always good. We're blurring our original model.

  1. We've assigned too much probability to the zeros, with the result that sharply predictable events [P(to|want)] became much less so, and some moderately rare events became very rare.
  2. We want a model that changes the existing model less, but still steals away some probability to assign to the zero events.
What went
Wrong
 

If we're going to assign the probability to zero-events, the probabilities of others has to go down.

Why? Because the probability of all the possible events we're looking at must add up to 1.

Take the case of want:

  • Count before smoothing: 1215
  • Count after smoothing: 1215 + 1616 = 2931
  • Number of word types not seen to follow want (estimating):
      Top 4 words (to, a, some, Thai) = .75 of the probability mass
      tokens not in top 4 = 304
            (.25 * 1215)
      a minimum of 1308 (1612 - 304) words never-before seen to follow want
  • This means that, in the model, following want, almost half of the probability mass is reserved for unseen events, 1308 events each of which has the probability 1/1308.
    • 1308 ÷ 2931 = .45
    • Which means the probability of all the previously seen words has to go down precipitously (1.0 -> .55)

    It's easy to see what the extreme case would be. Suppose the word to always followed the word want in our corpus but that want was a much rarer word, say, with count 100.

    Even in that case, we'd still have pretty good evidence that to was extremely likely after want. Our initial model would assign probability 1.

    What would happen with add-one smoothing?

    • Count before smoothing: 100
    • Count after smoothing: 100 + 1616 = 1716
    • Number of word types not seen to follow want: 1615.
    • probability for unseen events: 1615 / 1716 = .94
    • p(to|want) after smoothing = .06
Witten Bell
Discounting
The Idea
 

Key idea; Some words are promiscuous (they occur with a wide variety of words relative to their frequency).

Some are faithful: They occur with a very small number of words given their frequency.

Our fictional example of want was a maximally faithful word. 100 occurrences all followed by the the same word to.

Key Idea: Find a way of measuring word promiscuity. Relativize the amount of prpobability mass a worfd receives for zero's to how promiscuopus it is.

    The more promiscuous a word is, the word probability mass it receives for following zeroes (the more likley it is that we havent seen all the words that can follow it in any given corpus).
Probability of a new event  

Probability oif seeing a new type:

    T ÷ (N + T)
T is the number of observed types. N is the number of words in corpus:
    N + T = the number of words plus the number of types
Corpus viewed as a set of N + T events.
Unigram
Discounting
 

We will use

    T ÷ (N + T)
as our estimate of how much probability mass to reserve for zeros.

if we divided this equally, anmd there are Z zero ngrams: each 0 ngram would get this much

    T ÷ (Z*(N + T))
Bigram
Discounting
 

We relativize the probabilit of seeing a new type to each wor w.

This becomes our promiscuity measure.

    T(w) ÷ (N(w) + T(w))
The number of word types following w (T(w)) divided by the sum of the number of word tokens following w (= c(w), the count of w) and types following w (T(w). For an absolutely faithful word like out fictional want, what is this?
    Total prob mass reserved for want = 1 ÷ (100 + 1)
How about a maxially promiscuous word with the same frequency:
    Total prob mass reserved for want = 100 ÷ (100 + 100)

Lets use Z(w) for the count of the words NOT seen to follow w. Then our new conditional probability for an unseen word has to be divided among those Z words:

    prob(w'|w) = T(w) ÷ (Z(w)(N(w) + T(w)))
where w' has never been seen to follow w (c(ww')=0).

The tricky thing thing that each probability for a seen bigram has to get reduced by the right amount to make everything add up to 1:

    p(w'|w)= c(ww') ÷ (c(w) + T(w))
where w' has never been seen to follow w (c(ww') is bigger than 0).

The bigram counts will add up to c(w) (=N(w)). So the total probability mass for SEEN bigrams following w will be:

    N(w) ÷ (N(w) + T(w))
leaving us:
    T(w) ÷ (N(w) + T(w))
for the unseen bigrams. And this agrees with the amount we decided to reserve for them!

Berkeley Restaurant example revisited:

  1. Witten-Bell smoothed bigram counts.
  2. Unsmoothed bigram counts.
  3. Add-One smoothed bigram counts.
Smoothed counts: Multiply smoothed probabilities by corpus size N.