The noisy channel model

  1. Problem I: Spelling Correction We find the form acress. This could be a misspelling for any of a number of different words:
    1. actress
    2. cress
    3. caress
    4. access
    5. across
    6. acres
    Which one do we choose? Or, if we want to offer a list of candidate corrections, how do we rank the elements of that list?
  2. Problem 2: Speech recognition We have an acoustic signal, suitably digitized, and analyzed into a set of features for each [suitably short] span of signal. This acoustic signal could correspond to any word in the dictionary. Which one do we choose? Or if we want a list of possibilities, how we rank the elements of the list?

    Components of the noisy channel model

    1. An observation O
    2. A word w to which it corresponds
    3. A nosiy channel C which may distort w.
    4. A decoder responsible for find the most likely w given O. Note: because of noise, the same O may potentially correspond to many different w.

    The situations:

    1. Speech recognition. O is an acoustic signal. The word w is the word the speaker actually uttered.
    2. Orthography. O is a sequence of letters (possibly misspelled). The word w is the word the writer actually intended.

    We use a probabilistic model


    Spelling Correction

    Different kinds of spelling errors(typographic versus cognitive). Different sources (OCR versus human).

    An OCR example:

    Two approaches
    1. Channel modeling: Build a model of how signals are realized given the properties of the channel. (an error or distortion model)
    2. Message modeling (Build a model of what messages look like)
    Both models can be probabilistic.

    Speech recognition:

    Spelling correction

    We apply the Bayesian method to the problem of spelling correction first using an error model.

    1. Observation = t (for "typo")
    2. Word = c (for "correction"):

    We bring in our probabilistic model:

    We need two models:

    1. Likelihood model: P(t | c)
    2. Prior model: P(c)

    P(c) Model  

    We build a frequency table to get the priors (dividing counts by 44 million to get P(c)):

      c freq(c) p(c)
      actress 1343 .0000315
      cress 0 0
      caress 4 .0000001

    P(t | c) Model

     

    For the P( t | c) model, we first classify errors..

    We estimate p(t | c) using 4 confusion matrices, one for deletion, substitution, insertion, and transposition.

    Using [ ... ] for the number of times that ... happened:

    1. del[x,y] = [xy => x]
    2. ins[x,y] = [x => xy]
    3. sub[x,y] = [x => y]
    4. trans[x,y] = [xy => yx]

    For a deletion, for example, we estimate P(t | c) as follows:

      P(x | xy) = del[x,y] ÷ count(xy)

    Table of final probabilities


    Minimum Edit Distance

    Notice that in order to apply our probability model, we need to take the form acress, and for each correction c in our list of candidates, we need to compute the exact set of editing steps that takes you from acress to c.

    This can be thought of as a search for the best possible alignment of two sets of strings. Consider a non trivial case. Aligning the words execution and intention to maximize the amount of overlap.

      Aligment trace

    Or think of Unix diff on intention.txt and execution.txt

    % diff intention.txt execution.txt 
    1,3d0
    < i
    < n
    < t
    5c2,5
    < n
    ---
    > x
    > e
    > c
    > u
    
    This corresponds to the following alignment:
    i n t e n       t i o n
    
          e x e c u t i o n
    

    Idea: Finding the shortest possible edit distance between words corresponds to finding the best possible alignment.

    Note: One can also do spelling correction without using the probabilistic model and just using edit distance.

      Just assign a cost (possibly the same cost) to each editing operation and add, and that gives the cost of a misspelling target under the best alignment with a candidate source. The source that is the shortest editing distance away is the winner.

    But either way: you have to solve the best alignment problem.

    Edit Distance between two strings T and S

      The smallest number of individual editing operations needed to transform T into S

    Minumum editing distance between T and S.

    1. First assign a cost to each editing operation
        Alternative Levenshtein Distance (ALD): The operations insertion and deletion cost 1. Substitution (a combined deletion and insertion) costs 2, except in the case of substituting a character for itself, which costs 0.
      Note: We could put the probabilities in here, thus combining the alignment computatiopn with the probability computation, using the model for Prob(T | S) sketched above). For simplicity we use ALD. Combining the probabilities in with the alignment calculation could give different results than doing the 2 calculations separately. How?
    2. The ALD between T and S is the minimum cost possible for a sequence of editing operations that transform T into S.

    Essential intuition of the algorithm. Each path from T to S goes through a sequence of intermediate strings. If Si lies on the optimal path P from T to S, then then the sequence leading from T to Si must also be optimal.

    We now need ways of computing the shortest distance between any target and any source: shortest_distance(s,t).

    1. We let concatenation be represented by "+":
      • "inte" + "n" = "inte" concatenated with "n" = "inten"
      • "exec" + "u" = "exec" concatenated with "u" = "execu"
    2. We give an inductive definition of path cost.
      1. shortest_distance(w, w) = 0
          [special case: shortest_distance(eps,eps)=0]
      2. PathCost(w, w'+x)= shortest_distance(w,w') + ins-cost(w, w+x)
      3. PathCost(w+x,w') = shortest_distance(w,w') + del-cost(w+x,w')
      4. PathCost(w+x,w'+y) = shortest_distance(w,w') + subst-cost(w+x,w'+y)
    3. inten, execu
      1. PathCost(inten, exec+u)= shortest_distance(inten,exec) + ins-cost(inten, execu)
      2. PathCost(inte+n,execu) = shortest_distance(inte,execu) + del-cost(inten,execu)
      3. PathCost(inte+n,exec+u) = shortest_distance(inte,exec) + subst-cost(inten,execu)
    4. We define shortest_distance as the Minimum path cost.

    Computation of a minimum edit distance

      Target  
      n 2
      insert
      3
      insert
      4  
      i 1
      insert
      2
      subst
      3
      del
      # 0 1
      del
      2
      del
        # e x Source

    Different aligments, different costs:
    target i r t i on
    source i   t i on 1
    source i t i on   6

    The algorithm