Package aligner
[hide private]
[frames] | no frames]

Source Code for Package aligner

  1  """ 
  2  Compute minimum edit distance between two strings, C{source} and 
  3  C{target}, using Levenshtein distance. Return a triple consisting of 
  4  a viterbi table, a paths table, and a list of pairs.  The list of 
  5  pairs represents the string alignment resulting from following the 
  6  minimal editing path from C{source} to C{target}. Print out the 
  7  alignment corresponding to that list of pairs and the Levenshtein edit 
  8  distance for that edit path. 
  9   
 10  The two tables returned are: 
 11       
 12         - viterbi: a table in which each cell (i,j) contains the 
 13         cheapest edit cost, an int, for aligning target[:i+1] 
 14         with source[:j+1].  Corresponds to the table in Figure 
 15         3.27 in Chapter 3 of the second edition.  Note that the 
 16         corresponding figure (5.6) in the first edition has some errors 
 17         in it which have been fixed. 
 18          
 19         - paths: a table in which each cell (i,j) contains the (row,col) 
 20         pair for the predecessor cell in the cheapest edit path to (i,j). 
 21   
 22  The algorithm used follows exactly the code given in J&M 
 23  for Viterbi alignment (U{pseudocode<http://www-rohan.sdsu.edu/~gawron/compling/chap5/fig05.05.pdf>}).   A set of example target source pairs 
 24  is provided at the end of the align  module.  Here are some results: 
 25   
 26       >>> (target0,source0) = ('execution','intention') 
 27       >>> (vit0,paths0,pairs0) = align(target0,source0) 
 28       i n t e 0 n t i o n 
 29       0 e x e c u t i o n 
 30       - - - - - - - - - - 
 31       1 2 2 0 1 2 0 0 0 0 
 32       Total: 8 
 33       >>> vit0 
 34       [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 
 35       [1, 2, 3, 4, 5, 6, 7, 6, 7, 8], 
 36       [2, 3, 4, 5, 6, 7, 8, 7, 8, 7], 
 37       [3, 4, 5, 6, 7, 8, 7, 8, 9, 8], 
 38       [4, 3, 4, 5, 6, 7, 8, 9, 10, 9], 
 39       [5, 4, 5, 6, 7, 8, 9, 10, 11, 10], 
 40       [6, 5, 6, 7, 8, 9, 8, 9, 10, 11], 
 41       [7, 6, 7, 8, 9, 10, 9, 8, 9, 10], 
 42       [8, 7, 8, 9, 10, 11, 10, 9, 8, 9], 
 43       [9, 8, 9, 10, 11, 12, 11, 10, 9, 8]] 
 44   
 45       >>> pairs0 
 46       [['0', 'i'], ['e', 'n'], ['x', 't'], ['e', 'e'], ['c', '0'], ['u', 'n'], ['t', 't'], ['i', 'i'], ['o', 'o'], ['n', 'n']] 
 47       >>> paths0[8][8] 
 48       (7, 7) 
 49       >>> paths0[7][7] 
 50       (6, 6) 
 51       >>> paths0[6][6] 
 52       (5, 5) 
 53       >>> paths0[5][5] 
 54       (4, 4) 
 55       >>> paths0[4][4] 
 56       (3, 4) 
 57       >>> target0[4] 
 58       'u' 
 59       >>> source0[4] 
 60       'n' 
 61       >>> target0[3] 
 62       'c' 
 63       >>> paths0[3][4] 
 64       (2, 3) 
 65       >>> 
 66   
 67  The scores in a correct viterbi table should exactly mirror 
 68  those shown above for this example, since the scores above 
 69  are the shortest Levenshtein edit distances for the substrings of 
 70  these two words. The contents of the paths table may differ, 
 71  however, since there is generally more than one  
 72  shortest edit path; in fact, for this example, 
 73  there are many edit paths that acheive the cheapest edit 
 74  score of 8.  All of them share the property of aligning 
 75  one of the 'e's in 'execution' with the 'e' in 'intention'. 
 76  Implementationally, which edit path you get will depend 
 77  on which choice you make when there are ties for cheapest 
 78  edit path to a cell.  The place where such ties arise 
 79  are all shown in Figure 3.27.  For example, the lowest 
 80  8 in the shaded path corresponds to aligning 
 81  'inten' with 'execu', and using 0-based column-row indexing, 
 82  corresponds to paths[5][5].  Since there are three  
 83  arrows  in that cell, the possible 
 84  entries in a correct paths table for that cell are (4,4), 
 85  (4,5), and (5,4).  The paths table in the output shown 
 86  above has chosen (4,4), which corresponds to following 
 87  the shaded path in Figure 3.27. 
 88   
 89  The shaded path in Figure 3.27 is built up by starting 
 90  at the point in the paths table which aligns the entire 
 91  string C{intention} with the entire string C{execution}. 
 92  Using 0-based column-row indexing, that is C{paths[9][9]}. 
 93   
 94     >>> paths[9][9] 
 95     (8,8) 
 96      
 97  We then look up (8,8) in the C{paths} table: 
 98   
 99     >>> paths[8][8] 
100     (7,7) 
101   
102  We then look up the address in (7,7) and so on, until we get to (0,0). 
103  We are guaranteed to get to (0,0) because of the possible edit steps 
104  at each position in the C{paths} table.  This process of constructing 
105  the path backward from the end to the beginning is called 
106  C{backtracing}.  C{pairs} is computed in parallel, while backtracing. 
107  What characters go into C{pairs} depends on where you are coming from. 
108  If you find (7,7) in the paths table and you are coming from (8,8), 
109  that is a substitution step and you add (target[7],source[7]) to 
110  pairs.  If you are coming to (7,7) from (8,7), you have already seen 
111  the character at C{source[7]}. No character is aligned twice, so this 
112  step corresponds to aligning the character at C{target[7]} with the 
113  empty string; using '0' to represent the empty string, you add 
114  (target[7],'0') to C{pairs}.  Thus, you have added an alignment pair 
115  corresponding to an insertion edit. 
116   
117  You can print the viterbi table in the format used in 
118  Figure 6 using C{print_viterbi_table}. 
119   
120     >>> print_viterbi_table(vit0,'#'+target0,'#'+source0) 
121   
122  This will print something like the following:: 
123   
124        #  e  x  e  c  u  t  i  o  n  
125      n 09 08 09 10 11 12 11 10 09 08 
126      o 08 07 08 09 10 11 10 09 08 09 
127      i 07 06 07 08 09 10 09 08 09 10 
128      t 06 05 06 07 08 09 08 09 10 11 
129      n 05 04 05 06 07 08 09 10 11 10 
130      e 04 03 04 05 06 07 08 09 10 09 
131      t 03 04 05 06 07 08 07 08 09 08 
132      n 02 03 04 05 06 07 08 07 08 07 
133      i 01 02 03 04 05 06 07 06 07 08 
134      # 00 01 02 03 04 05 06 07 08 09 
135         #  e  x  e  c  u  t  i  o  n  
136   
137  Note: the version printed here sticks to the basic format in the book. 
138  Rows are printed counting upwards, cols left to right.  Column 
139  addresses in the viterbi table come before row addresses. The value of 
140  vit0[0][2] is 02. The value of vit0[4][1] is 05, the value of 
141  vit0[1][4] is 03.  Note also: In comparing this example with my code, 
142  I am treating "intention" as source and "execution" as target. 
143   
144  Other examples. 
145   
146       >>> (target1,source1) = ('spat','at') 
147       >>> (vit1,paths1,pairs1) = align(target1,source1) 
148       s p a t 
149       0 0 a t 
150       - - - - 
151       1 1 0 0 
152       Total: 2 
153       >>> (target2,source2) = ('faltluence', 'flatulence') 
154       >>> (vit2,paths2,pairs2) = align(target2,source2) 
155       f 0 a l t 0 l u e n c e 
156       f l a 0 t u l 0 e n c e 
157       - - - - - - - - - - - - 
158       0 1 0 1 0 1 0 1 0 0 0 0 
159       Total: 4 
160       >>> (target3,source3) = ('fluency', 'flatulence') 
161       >>> (vit3,paths3,pairs3) = align(target3,source3) 
162       f l 0 0 u 0 e n c y 
163       f l a t u l e n c e 
164       - - - - - - - - - - 
165       0 0 1 1 0 1 0 0 0 2 
166       Total: 5 
167       >>> target4 
168       'drive' 
169       >>> source4 
170       'brief' 
171       >>> target5 
172       'drive' 
173       >>> source5 
174       'divers' 
175   
176  Output for targets 4 and 5 has been suppressed since computing 
177  these is part of the assignment. 
178  """ 
179