Final Bigram Problem:
Unsmoothed probabilities
- P(the|in) = c(in,the)/c(in) = 324 / 1476 = 0.219
- P(in|the) = c(the,in)/c(the) = 0 / 4071 = 0
- P(in the) = c(in,the) / N = 324 / 89141 = 0.0036
- P(S-T-A-R-T The) = c(S-T-A-R-T,The) / N = 414 / 89141 = 0.0046
- P(to|the) = c(the,to) / c(the) = 0 / 4071 = 0
- P(the government) = c(the,government) / N = 25 / 89141 = 0.00027
- P(billion in) = c(billion,in) / N = 25 / 89141 = 0.00027
- P(in|billion) = c(billion,in) / c(billion) = 25 / 108 = 0.231
Smoothed probabilities:
Add one smoothing:
The size of the corpus is given by:
[wsj]$ wc ws950228.1line
5155 89141 481254 ws950228.1line
Here wc is the "word count" utility in Unix. It
gives line, word, and character count. So
N = 89141
The following Python snippet gets the same answer:
>>> import os
>>> os.chdir('/var/www/html/wsj/')
>>> fh = open('ws950228.wfreq','r')
>>> ct = 0
>>> for line in fh:
words = line.strip().split()
if len(words) == 2:
try:
ct += int(words[1])
except:
pass
>>> ct
89141
The size of the vocabulary is the number of types in the corpus.
One way to compute this is to count
the number of lines in ws950228.wfreq.sorted.
[gawron@bulba wsj]$ wc ws950228.wfreq.sorted
11340 22680 116124 ws950228.wfreq.sorted
Here 11340 is the number of lines, 22680 the number
of words and 116124 the number of characters.
So
V = 11340
- P(the|in) = c(in,the)/c(in) = 324 + 1 / 1476 + 11340 = 0.025
- P(in|the) = c(the,in)/c(the) = 0 + 1 / 4071 + 11340 = 6.489e-05
- P(in the) = c(in,the) / N = 324 + 1 / (89141 + 113402) = 2.526e-06
- P(S-T-A-R-T The) = c(S-T-A-R-T,The) / N = (414 + 1) / (89141 + 113402) = 3.225e-06
- P(to|the) = c(the,to) / c(the) = (0 + 1) / (4071 + 11340) = 6.489e-05
- P(the government) = c(the,government) / N = 25 + 1 / (89141 + 113402) = 2.02e-07
- P(billion in) = c(billion,in) / N = 25 + 1 / (89141 + 113402) = 2.02e-07
- P(in|billion) = c(billion,in) / c(billion) = 25 + 1 / (108 + 11340) = 0.00227
Good-Turing questions: The first few questions
were answered by inspecting the file ws950228.bigram.sorted.
- What is the frequency of frequency 1 for bigrams (the number
of bigrams that occur 1 time)? 41,162
|
line no first freq 1 bigram | 1 |
|
line no last freq 1 bigram | 41,162 |
|
Count | 41,162 |
- What is the frequency of frequency 2?
|
line no first freq 2 bigram | 46,780 |
|
line no last freq 1 bigram | 41,162 |
|
Difference | 5618 |
- What is the discounted count (c*)
of the bigram same problem?
The bigram same problem occurs on
line 29981 and has freq 1. Using Eq 4.26 from J&M,
section 4.5.2:
c* = (c + 1)(Nc+1/Nc)
Plugging in t he numbers:
c*(same problem) = 2 * 5618/41162 = 0.273
- What is the re-estimated Good-Turing
probability of the bigram same bank?
(I am asking for P(same bank),
not P(bank | same)).
This bigram has count 0. We estimate the prob using Eqn
4.27 for the missing mass (the amount of prob reserved for
things that occurred 0 times):
P*GT = N1/N
Intuitively this is proportion of times we saw something
never seen before, so it is the probability that the N+1st bigram
will be something unseen.
So:
P*GT = 41162/89141 = .4618
We divide this among the number of unseen bigrams, vocabulary size
squared - seen bigrams
|
11340 * 11340 | = | 128,595,600 |
|
128,595,600 - 89141 | = | 128,506,459 |
So the GT estimate for the probability of any unseen
bigram is
.4618/128,506,459 = 3.594 * 10-9
So in particular this is the GT estimate for our given
bigram:
P*GT(same bank) = 3.594 * 10-9
- What is the number of unseen bigrams?
(assuming that V is just the words seen in the corpus).
We computed this above: 128,506,459.
-
What percentage of the bigram types are hapax
legomena (have just one occurrence)?
| Num lines sorted bigrams | = | 51,506 |
| Num hapax legomena | = | 41,162 |
| Percentage hapax legomena | = | 80% |
- What percentage of the
bigram types occur 5 times or fewer?
First 6 occurrence bigram in sorted bigram file is
'000 or' on line 49942, so:
49942/51506 = .9696 = 96.96 %
of the bigrams occur 5 times or fewer in this 80,000 word sample.
Witten-Bell smoothing:
- The probability of encountering a new type is:
V / (N + V) = 11340 / 100481 = 0.113
The probability of encountering a new type given a specific previous word w0 is:
prob(w'|w0) = T(w0) / (N(w0) + T(w0))
To get the probability of encountering a new type following
the we need to first count up the number of new words
following the, which is the number of bigrams
whose first word is the:
[wsj] grep '^the ' ws950228.bigram | wc
1650 4950 23470
So in our corpus there are 1650 distinct words
following the, so T(the) = 1650.
T(the) = 1650
N(the) = c(the) = 4071
Then, the probability of a new word following
the is:
prob(w'|w) = 1650 / ( 4071 + 1650) = 0.289
- Percentage of hapax legomena bigram types: 79.917%
- Percentage of bigram types that occur 5 times or fewer: 96.96%