Final Bigram Problem:

Unsmoothed probabilities

  1. P(the|in) = c(in,the)/c(in) = 324 / 1476 = 0.219
  2. P(in|the) = c(the,in)/c(the) = 0 / 4071 = 0
  3. P(in the) = c(in,the) / N = 324 / 89141 = 0.0036
  4. P(S-T-A-R-T The) = c(S-T-A-R-T,The) / N = 414 / 89141 = 0.0046
  5. P(to|the) = c(the,to) / c(the) = 0 / 4071 = 0
  6. P(the government) = c(the,government) / N = 25 / 89141 = 0.00027
  7. P(billion in) = c(billion,in) / N = 25 / 89141 = 0.00027
  8. P(in|billion) = c(billion,in) / c(billion) = 25 / 108 = 0.231

Smoothed probabilities:

Add one smoothing:

The size of the corpus is given by:

[wsj]$ wc ws950228.1line
  5155  89141 481254 ws950228.1line
Here wc is the "word count" utility in Unix. It gives line, word, and character count. So
N = 89141

The following Python snippet gets the same answer:

>>> import os
>>> os.chdir('/var/www/html/wsj/')
>>> fh = open('ws950228.wfreq','r')
>>> ct = 0

>>> for line in  fh:
    words = line.strip().split()
       if len(words) == 2:
          try:
             ct += int(words[1])
          except:
             pass
>>> ct
89141

The size of the vocabulary is the number of types in the corpus. One way to compute this is to count the number of lines in ws950228.wfreq.sorted.

[gawron@bulba wsj]$ wc ws950228.wfreq.sorted 
 11340  22680 116124 ws950228.wfreq.sorted
Here 11340 is the number of lines, 22680 the number of words and 116124 the number of characters. So
V = 11340
  1. P(the|in) = c(in,the)/c(in) = 324 + 1 / 1476 + 11340 = 0.025
  2. P(in|the) = c(the,in)/c(the) = 0 + 1 / 4071 + 11340 = 6.489e-05
  3. P(in the) = c(in,the) / N = 324 + 1 / (89141 + 113402) = 2.526e-06
  4. P(S-T-A-R-T The) = c(S-T-A-R-T,The) / N = (414 + 1) / (89141 + 113402) = 3.225e-06
  5. P(to|the) = c(the,to) / c(the) = (0 + 1) / (4071 + 11340) = 6.489e-05
  6. P(the government) = c(the,government) / N = 25 + 1 / (89141 + 113402) = 2.02e-07
  7. P(billion in) = c(billion,in) / N = 25 + 1 / (89141 + 113402) = 2.02e-07
  8. P(in|billion) = c(billion,in) / c(billion) = 25 + 1 / (108 + 11340) = 0.00227

Good-Turing questions: The first few questions were answered by inspecting the file ws950228.bigram.sorted.

  1. What is the frequency of frequency 1 for bigrams (the number of bigrams that occur 1 time)? 41,162
      line no first freq 1 bigram 1
      line no last freq 1 bigram 41,162
      Count 41,162
    1. What is the frequency of frequency 2?
        line no first freq 2 bigram 46,780
        line no last freq 1 bigram 41,162
        Difference 5618
    2. What is the discounted count (c*) of the bigram same problem?
        The bigram same problem occurs on line 29981 and has freq 1. Using Eq 4.26 from J&M, section 4.5.2:
          c* = (c + 1)(Nc+1/Nc)
        Plugging in t he numbers:
          c*(same problem) = 2 * 5618/41162 = 0.273
    3. What is the re-estimated Good-Turing probability of the bigram same bank? (I am asking for P(same bank), not P(bank | same)).
        This bigram has count 0. We estimate the prob using Eqn 4.27 for the missing mass (the amount of prob reserved for things that occurred 0 times):
          P*GT = N1/N
        Intuitively this is proportion of times we saw something never seen before, so it is the probability that the N+1st bigram will be something unseen. So:
          P*GT = 41162/89141 = .4618
        We divide this among the number of unseen bigrams, vocabulary size squared - seen bigrams
          So the GT estimate for the probability of any unseen bigram is
            .4618/128,506,459 = 3.594 * 10-9
          So in particular this is the GT estimate for our given bigram:
            P*GT(same bank) = 3.594 * 10-9
        1. What is the number of unseen bigrams? (assuming that V is just the words seen in the corpus). We computed this above: 128,506,459.
        2. What percentage of the bigram types are hapax legomena (have just one occurrence)?
        3. 11340 * 11340 = 128,595,600
          128,595,600 - 89141 = 128,506,459
          Num lines sorted bigrams = 51,506
          Num hapax legomena = 41,162
          Percentage hapax legomena = 80%
      1. What percentage of the bigram types occur 5 times or fewer?
          First 6 occurrence bigram in sorted bigram file is '000 or' on line 49942, so:
            49942/51506 = .9696 = 96.96 %
          of the bigrams occur 5 times or fewer in this 80,000 word sample.


      Witten-Bell smoothing:

      1. The probability of encountering a new type is:
          V / (N + V) = 11340 / 100481 = 0.113
        The probability of encountering a new type given a specific previous word w0 is:
          prob(w'|w0) = T(w0) / (N(w0) + T(w0))

        To get the probability of encountering a new type following the we need to first count up the number of new words following the, which is the number of bigrams whose first word is the:

          [wsj] grep '^the ' ws950228.bigram | wc 1650 4950 23470
        So in our corpus there are 1650 distinct words following the, so T(the) = 1650.
        T(the) = 1650
        N(the) = c(the) = 4071
        
        Then, the probability of a new word following the is:
        prob(w'|w) = 1650 / ( 4071 + 1650) = 0.289
        
      2. Percentage of hapax legomena bigram types: 79.917%
      3. Percentage of bigram types that occur 5 times or fewer: 96.96%