Phonology

The noisy channel model

Problem I: Spelling Correction We find the form acress. This could be a misspelling for any of a number of different words:
1. actress
2. cress
3. caress
4. access
5. across
6. acres
Which one do we choose? Or, if we want to offer a list of candidate corrections, how do we rank the elements of that list?

Problem 2: Speech recognition We have an acoustic signal, suitably digitized, and analyzed into a set of features for each [suitably short] span of signal. This acoustic signal could correspond to any word in the dictionary. Which one do we choose? Or if we want a list of possibilities, how we rank the elements of the list?

Components of the noisy channel model

An observation O
A word w to which it corresponds
A nosiy channel C which may distort w.
A decoder responsible for find the most likely w given O. Note: because of noise, the same O may potentially correspond to many different w.

The situations:

Speech recognition. O is an acoustic signal. The word w is the word the speaker actually uttered.
Orthography. O is a sequence of letters (possibly misspelled). The word w is the word the writer actually intended.

We use a probabilistic model

w = argmax_{w in V} Prob(O | w)
w = argmax_{w in V} Prob(w | O) * Prob(w) ÷ Prob(O)
w = argmax_{w in V} Prob(w | O) * Prob(w)
w is the word that maximizes the probability that w gave rise to the observation (signal).
w is the word that maximizes the product of the likelihood of w (Prob(w | O)) times the prior probability (Prob(w)) of w

P(O \| w)	*	P(w)
Likelihood	*	Prior

Spelling Correction

Different kinds of spelling errors(typographic versus cognitive). Different sources (OCR versus human).

An OCR example:

The quick brown fox jumps over the lazy yellow dog.
The q~ick brown foxjumps over tb l azy yellow dog.

Two approaches

Channel modeling: Build a model of how signals are realized given the properties of the channel. (an error or distortion model)
Message modeling (Build a model of what messages look like)

Both models can be probabilistic.

Speech recognition:

Acoustic model relating acoustic information to phones (channel model)
Language model predicting what word comes next given (message model) previous context.

Spelling correction

Error model relating dictionary words to their misspellings (channel model)
Language model predicting what word comes next given (message model) previous context.

We apply the Bayesian method to the problem of spelling correction first using an error model.

Observation = t (for "typo")
Word = c (for "correction"):

We bring in our probabilistic model:

c = argmax_{c in V} Prob(c| t)
c = argmax_{c in V} Prob(t | c) * Prob(c)

We need two models:

Likelihood model: P(t | c)
Prior model: P(c)

P(c) Model
We build a frequency table to get the priors (dividing counts by 44 million to get P(c)):

c freq(c) p(c)

actress 1343 .0000315

cress 0 0

caress 4 .0000001

P(t | c) Model

For the P( t | c) model, we first classify errors..
We estimate p(t | c) using 4 confusion matrices, one for deletion, substitution, insertion, and transposition.
Using [ ... ] for the number of times that ... happened:

del[x,y] = [xy => x]
ins[x,y] = [x => xy]
sub[x,y] = [x => y]
trans[x,y] = [xy => yx]

For a deletion, for example, we estimate P(t | c) as follows:
P(x | xy) = del[x,y] ÷ count(xy)

Table of final probabilities

Minimum Edit Distance

Notice that in order to apply our probability model, we need to take the form acress, and for each correction c in our list of candidates, we need to compute the exact set of editing steps that takes you from acress to c.

This can be thought of as a search for the best possible alignment of two sets of strings. Consider a non trivial case. Aligning the words execution and intention to maximize the amount of overlap.

Aligment trace

Or think of Unix diff on intention.txt and execution.txt

% diff intention.txt execution.txt 
1,3d0
< i
< n
< t
5c2,5
< n
---
> x
> e
> c
> u

This corresponds to the following alignment:

i n t e n       t i o n

      e x e c u t i o n

Idea: Finding the shortest possible edit distance between words corresponds to finding the best possible alignment.

Note: One can also do spelling correction without using the probabilistic model and just using edit distance.

Just assign a cost (possibly the same cost) to each editing operation and add, and that gives the cost of a misspelling target under the best alignment with a candidate source. The source that is the shortest editing distance away is the winner.

But either way: you have to solve the best alignment problem.

Edit Distance between two strings T and S

The smallest number of individual editing operations needed to transform T into S

Minumum editing distance between T and S.

First assign a cost to each editing operation
Note: We could put the probabilities in here, thus combining the alignment computatiopn with the probability computation, using the model for Prob(T | S) sketched above). For simplicity we use ALD. Combining the probabilities in with the alignment calculation could give different results than doing the 2 calculations separately. How?
The ALD between T and S is the minimum cost possible for a sequence of editing operations that transform T into S.

Essential intuition of the algorithm. Each path from T to S goes through a sequence of intermediate strings. If S_i lies on the optimal path P from T to S, then then the sequence leading from T to S_i must also be optimal.

We now need ways of computing the shortest distance between any target and any source: shortest_distance(s,t).

We let concatenation be represented by "+":
- "inte" + "n" = "inte" concatenated with "n" = "inten"
- "exec" + "u" = "exec" concatenated with "u" = "execu"
We give an inductive definition of path cost.
1. shortest_distance(w, w) = 0
2. PathCost(w, w'+x)= shortest_distance(w,w') + ins-cost(w, w+x)
3. PathCost(w+x,w') = shortest_distance(w,w') + del-cost(w+x,w')
4. PathCost(w+x,w'+y) = shortest_distance(w,w') + subst-cost(w+x,w'+y)
inten, execu
1. PathCost(inten, exec+u)= shortest_distance(inten,exec) + ins-cost(inten, execu)
2. PathCost(inte+n,execu) = shortest_distance(inte,execu) + del-cost(inten,execu)
3. PathCost(inte+n,exec+u) = shortest_distance(inte,exec) + subst-cost(inten,execu)
We define shortest_distance as the Minimum path cost.

Computation of a minimum edit distance

Target
n	2 insert	3 insert	4
i	1 insert	2 subst	3 del
#	0	1 del	2 del
	#	e	x	Source

Different aligments, different costs:

target i r t i o n

source i t i o n 1

source i t i o n 6

The algorithm