1 """
2 Compute minimum edit distance between two strings, C{source} and
3 C{target}, using Levenshtein distance. Return a triple consisting of
4 a viterbi table, a paths table, and a list of pairs. The list of
5 pairs represents the string alignment resulting from following the
6 minimal editing path from C{source} to C{target}. Print out the
7 alignment corresponding to that list of pairs and the Levenshtein edit
8 distance for that edit path.
9
10 The two tables returned are:
11
12 - viterbi: a table in which each cell (i,j) contains the
13 cheapest edit cost, an int, for aligning target[:i+1]
14 with source[:j+1]. Corresponds to the table in Figure
15 3.27 in Chapter 3 of the second edition. Note that the
16 corresponding figure (5.6) in the first edition has some errors
17 in it which have been fixed.
18
19 - paths: a table in which each cell (i,j) contains the (row,col)
20 pair for the predecessor cell in the cheapest edit path to (i,j).
21
22 The algorithm used follows exactly the code given in J&M
23 for Viterbi alignment (U{pseudocode<http://www-rohan.sdsu.edu/~gawron/compling/chap5/fig05.05.pdf>}). A set of example target source pairs
24 is provided at the end of the align module. Here are some results:
25
26 >>> (target0,source0) = ('execution','intention')
27 >>> (vit0,paths0,pairs0) = align(target0,source0)
28 i n t e 0 n t i o n
29 0 e x e c u t i o n
30 - - - - - - - - - -
31 1 2 2 0 1 2 0 0 0 0
32 Total: 8
33 >>> vit0
34 [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
35 [1, 2, 3, 4, 5, 6, 7, 6, 7, 8],
36 [2, 3, 4, 5, 6, 7, 8, 7, 8, 7],
37 [3, 4, 5, 6, 7, 8, 7, 8, 9, 8],
38 [4, 3, 4, 5, 6, 7, 8, 9, 10, 9],
39 [5, 4, 5, 6, 7, 8, 9, 10, 11, 10],
40 [6, 5, 6, 7, 8, 9, 8, 9, 10, 11],
41 [7, 6, 7, 8, 9, 10, 9, 8, 9, 10],
42 [8, 7, 8, 9, 10, 11, 10, 9, 8, 9],
43 [9, 8, 9, 10, 11, 12, 11, 10, 9, 8]]
44
45 >>> pairs0
46 [['0', 'i'], ['e', 'n'], ['x', 't'], ['e', 'e'], ['c', '0'], ['u', 'n'], ['t', 't'], ['i', 'i'], ['o', 'o'], ['n', 'n']]
47 >>> paths0[8][8]
48 (7, 7)
49 >>> paths0[7][7]
50 (6, 6)
51 >>> paths0[6][6]
52 (5, 5)
53 >>> paths0[5][5]
54 (4, 4)
55 >>> paths0[4][4]
56 (3, 4)
57 >>> target0[4]
58 'u'
59 >>> source0[4]
60 'n'
61 >>> target0[3]
62 'c'
63 >>> paths0[3][4]
64 (2, 3)
65 >>>
66
67 The scores in a correct viterbi table should exactly mirror
68 those shown above for this example, since the scores above
69 are the shortest Levenshtein edit distances for the substrings of
70 these two words. The contents of the paths table may differ,
71 however, since there is generally more than one
72 shortest edit path; in fact, for this example,
73 there are many edit paths that acheive the cheapest edit
74 score of 8. All of them share the property of aligning
75 one of the 'e's in 'execution' with the 'e' in 'intention'.
76 Implementationally, which edit path you get will depend
77 on which choice you make when there are ties for cheapest
78 edit path to a cell. The place where such ties arise
79 are all shown in Figure 3.27. For example, the lowest
80 8 in the shaded path corresponds to aligning
81 'inten' with 'execu', and using 0-based column-row indexing,
82 corresponds to paths[5][5]. Since there are three
83 arrows in that cell, the possible
84 entries in a correct paths table for that cell are (4,4),
85 (4,5), and (5,4). The paths table in the output shown
86 above has chosen (4,4), which corresponds to following
87 the shaded path in Figure 3.27.
88
89 The shaded path in Figure 3.27 is built up by starting
90 at the point in the paths table which aligns the entire
91 string C{intention} with the entire string C{execution}.
92 Using 0-based column-row indexing, that is C{paths[9][9]}.
93
94 >>> paths[9][9]
95 (8,8)
96
97 We then look up (8,8) in the C{paths} table:
98
99 >>> paths[8][8]
100 (7,7)
101
102 We then look up the address in (7,7) and so on, until we get to (0,0).
103 We are guaranteed to get to (0,0) because of the possible edit steps
104 at each position in the C{paths} table. This process of constructing
105 the path backward from the end to the beginning is called
106 C{backtracing}. C{pairs} is computed in parallel, while backtracing.
107 What characters go into C{pairs} depends on where you are coming from.
108 If you find (7,7) in the paths table and you are coming from (8,8),
109 that is a substitution step and you add (target[7],source[7]) to
110 pairs. If you are coming to (7,7) from (8,7), you have already seen
111 the character at C{source[7]}. No character is aligned twice, so this
112 step corresponds to aligning the character at C{target[7]} with the
113 empty string; using '0' to represent the empty string, you add
114 (target[7],'0') to C{pairs}. Thus, you have added an alignment pair
115 corresponding to an insertion edit.
116
117 You can print the viterbi table in the format used in
118 Figure 6 using C{print_viterbi_table}.
119
120 >>> print_viterbi_table(vit0,'#'+target0,'#'+source0)
121
122 This will print something like the following::
123
124 # e x e c u t i o n
125 n 09 08 09 10 11 12 11 10 09 08
126 o 08 07 08 09 10 11 10 09 08 09
127 i 07 06 07 08 09 10 09 08 09 10
128 t 06 05 06 07 08 09 08 09 10 11
129 n 05 04 05 06 07 08 09 10 11 10
130 e 04 03 04 05 06 07 08 09 10 09
131 t 03 04 05 06 07 08 07 08 09 08
132 n 02 03 04 05 06 07 08 07 08 07
133 i 01 02 03 04 05 06 07 06 07 08
134 # 00 01 02 03 04 05 06 07 08 09
135 # e x e c u t i o n
136
137 Note: the version printed here sticks to the basic format in the book.
138 Rows are printed counting upwards, cols left to right. Column
139 addresses in the viterbi table come before row addresses. The value of
140 vit0[0][2] is 02. The value of vit0[4][1] is 05, the value of
141 vit0[1][4] is 03. Note also: In comparing this example with my code,
142 I am treating "intention" as source and "execution" as target.
143
144 Other examples.
145
146 >>> (target1,source1) = ('spat','at')
147 >>> (vit1,paths1,pairs1) = align(target1,source1)
148 s p a t
149 0 0 a t
150 - - - -
151 1 1 0 0
152 Total: 2
153 >>> (target2,source2) = ('faltluence', 'flatulence')
154 >>> (vit2,paths2,pairs2) = align(target2,source2)
155 f 0 a l t 0 l u e n c e
156 f l a 0 t u l 0 e n c e
157 - - - - - - - - - - - -
158 0 1 0 1 0 1 0 1 0 0 0 0
159 Total: 4
160 >>> (target3,source3) = ('fluency', 'flatulence')
161 >>> (vit3,paths3,pairs3) = align(target3,source3)
162 f l 0 0 u 0 e n c y
163 f l a t u l e n c e
164 - - - - - - - - - -
165 0 0 1 1 0 1 0 0 0 2
166 Total: 5
167 >>> target4
168 'drive'
169 >>> source4
170 'brief'
171 >>> target5
172 'drive'
173 >>> source5
174 'divers'
175
176 Output for targets 4 and 5 has been suppressed since computing
177 these is part of the assignment.
178 """
179