Note that most of the cells in our original count tables will be zero.
Most of the time our bigram model assigns probability zero
to a potential following word:
Add-one
Smoothing
basic idea
|
 
|
We add one to every cell of
this table
We get this table
We recompute our our total occurrences:
-
I 3437 +1616 =5053
-
want 1215 + 1616 = 2931
-
to 3256 + 1616 = 4872
-
eat 938 + 1616 = 2554
-
Chinese 213 + 1616 = 1829
-
food 1506 + 1616 = 3122
-
lunch 459 + 1616 = 2075
Now we recompute the probabilities:
- P(wn | wn-1) =
|wn-1wn| ÷ |wn |
- P(food | want) = |want to| ÷ |want| = 1 ÷ 2931 = .0003
- P(to | want) = |want to| ÷ |want| = 787 ÷ 2931 = .27
This gives us this bigram probability table.
Compare this one.
Some things to notice:
- The events that used to be zeroes don't all have the same probability.
- All the events in the same row that were zeros in the old model get
the same probability in the new model.
- ALL the non-zero probabilities went down.
- Sometimes the change doesn't look very large
- P(eat | I)[.0038 -> .0028]
- P(I | to)[.00092 -> .00082]
- Some very predictable events became less predictable:
- P(to|want)[.65-> .22]
- P(food|Chinese) [.56 -> .066]
- Other probabilities changed by large factors.
- P(lunch|Chinese) [.0047 -> .0011]
- P(food|want) [.0066 -> .0032]
- Likelihood ratios changed
- old model: P(I|lunch) = 4 * P(food|lunch)
- new model: P(I|lunch) = 2.5 * P(food|lunch)
Conclusion: Increasing the zero
probabilities from zero to a small number was good,
but the effect on the non-zero probabilities
was not always good. We're blurring our original model.
- We've assigned too much probability to the zeros,
with the result that sharply predictable events [P(to|want)] became much
less so, and some moderately rare events became very rare.
- We want a model that changes the existing model less, but still steals
away some probability to assign to the zero events.
|
What went
Wrong
|
 
|
If we're going to assign the probability to zero-events, the
probabilities of others has to go down.
Why? Because the probability of all the possible events we're looking
at must add up to 1.
Take the case of want:
- Count before smoothing: 1215
- Count after smoothing: 1215 + 1616 = 2931
- Number of word types not seen to follow want (estimating):
Top 4 words (to, a, some, Thai)
= .75 of the probability mass
tokens not in top 4 = 304
(.25 * 1215)
a minimum of 1308 (1612 - 304) words never-before seen
to follow want
- This means that, in the model, following want,
almost half of the probability mass is
reserved for unseen events, 1308 events each
of which has the probability 1/1308.
- 1308 ÷ 2931 = .45
- Which means the probability of all the previously seen
words has to go down precipitously (1.0 -> .55)
It's easy to see what the extreme case would be.
Suppose the word to always followed the word
want in our corpus but that
want was
a much rarer word, say, with count 100.
Even in that case, we'd still have pretty good evidence that
to was extremely likely after want. Our
initial model would assign probability 1.
What would happen with add-one smoothing?
- Count before smoothing: 100
- Count after smoothing: 100 + 1616 = 1716
- Number of word types not seen to follow want: 1615.
- probability for unseen events: 1615 / 1716 = .94
- p(to|want) after smoothing = .06
|
Witten Bell
Discounting
The Idea
|
 
|
Key idea; Some words are promiscuous (they
occur with a wide variety of words
relative to their frequency).
Some are faithful: They occur with a very
small number of
words given their frequency.
Our fictional example
of want was a maximally
faithful word. 100 occurrences
all followed by the the same word to.
Key Idea: Find a way of measuring
word promiscuity. Relativize the amount
of prpobability mass a worfd receives for zero's
to how promiscuopus it is.
The more promiscuous a word is,
the word
probability mass it receives
for following zeroes (the more
likley
it is that we havent seen all the words
that can follow it
in any given corpus).
|
|
Probability of
a new event
|
 
|
Probability oif seeing a new type:
T ÷ (N + T)
T is the number of observed types.
N is the number of words in corpus:
N + T = the number of words plus the number of types
Corpus viewed as a set of N + T events.
|
Unigram
Discounting
|
 
|
We will use
T ÷ (N + T)
as our estimate of how much probability
mass to reserve for zeros.
if we divided this equally, anmd there
are Z zero ngrams: each 0 ngram would
get this much
T ÷ (Z*(N + T))
|
Bigram
Discounting
|
 
|
We relativize the probability of
seeing a new
type to each wor w.
This becomes our promiscuity measure.
T(w) ÷ (N(w) + T(w))
The number of word types following w (T(w)) divided by the sum of the
number of word tokens following w (= c(w), the count of w) and types
following w (T(w). For an absolutely faithful word like out fictional
want, what is this?
Total prob mass reserved for want = 1 ÷ (100 + 1)
How about a maxially promiscuous word with the same frequency:
Total prob mass reserved for want = 100 ÷ (100 + 100)
Lets use Z(w) for the count of the words
NOT seen to follow w. Then our
new conditional
probability for an unseen
word has to be divided among those Z words:
prob(w'|w) = T(w) ÷ (Z(w)(N(w) + T(w)))
where w' has never been seen to follow w
(c(ww')=0).
The tricky thing thing that each probability
for a seen bigram has to get reduced by
the right amount to make everything add up to 1:
p(w'|w)= c(ww') ÷ (c(w) + T(w))
where w' has never been seen to follow w
(c(ww') is bigger than 0).
The bigram counts will add up to c(w) (=N(w)).
So the total probability mass for SEEN
bigrams following w will be:
N(w) ÷ (N(w) + T(w))
leaving us:
T(w) ÷ (N(w) + T(w))
for the unseen bigrams. And this agrees
with the amount we decided to reserve for
them!
Berkeley Restaurant example revisited:
-
Witten-Bell
smoothed bigram counts.
-
Unsmoothed
bigram counts.
-
Add-One
smoothed bigram counts.
Smoothed counts: Multiply smoothed probabilities
by corpus size N.
|