Package aligner
[hide private]
[frames] | no frames]

Package aligner

source code

Compute minimum edit distance between two strings, source and target, using Levenshtein distance. Return a triple consisting of a viterbi table, a paths table, and a list of pairs. The list of pairs represents the string alignment resulting from following the minimal editing path from source to target. Print out the alignment corresponding to that list of pairs and the Levenshtein edit distance for that edit path.

The two tables returned are:

The algorithm used follows exactly the code given in J&M for Viterbi alignment (pseudocode). A set of example target source pairs is provided at the end of the align module. Here are some results:

>>> (target0,source0) = ('execution','intention')
>>> (vit0,paths0,pairs0) = align(target0,source0)
i n t e 0 n t i o n
0 e x e c u t i o n
- - - - - - - - - -
1 2 2 0 1 2 0 0 0 0
Total: 8
>>> vit0
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[1, 2, 3, 4, 5, 6, 7, 6, 7, 8],
[2, 3, 4, 5, 6, 7, 8, 7, 8, 7],
[3, 4, 5, 6, 7, 8, 7, 8, 9, 8],
[4, 3, 4, 5, 6, 7, 8, 9, 10, 9],
[5, 4, 5, 6, 7, 8, 9, 10, 11, 10],
[6, 5, 6, 7, 8, 9, 8, 9, 10, 11],
[7, 6, 7, 8, 9, 10, 9, 8, 9, 10],
[8, 7, 8, 9, 10, 11, 10, 9, 8, 9],
[9, 8, 9, 10, 11, 12, 11, 10, 9, 8]]
>>> pairs0
[['0', 'i'], ['e', 'n'], ['x', 't'], ['e', 'e'], ['c', '0'], ['u', 'n'], ['t', 't'], ['i', 'i'], ['o', 'o'], ['n', 'n']]
>>> paths0[8][8]
(7, 7)
>>> paths0[7][7]
(6, 6)
>>> paths0[6][6]
(5, 5)
>>> paths0[5][5]
(4, 4)
>>> paths0[4][4]
(3, 4)
>>> target0[4]
'u'
>>> source0[4]
'n'
>>> target0[3]
'c'
>>> paths0[3][4]
(2, 3)
>>>

The scores in a correct viterbi table should exactly mirror those shown above for this example, since the scores above are the shortest Levenshtein edit distances for the substrings of these two words. The contents of the paths table may differ, however, since there is generally more than one shortest edit path; in fact, for this example, there are many edit paths that acheive the cheapest edit score of 8. All of them share the property of aligning one of the 'e's in 'execution' with the 'e' in 'intention'. Implementationally, which edit path you get will depend on which choice you make when there are ties for cheapest edit path to a cell. The place where such ties arise are all shown in Figure 3.27. For example, the lowest 8 in the shaded path corresponds to aligning 'inten' with 'execu', and using 0-based column-row indexing, corresponds to paths[5][5]. Since there are three arrows in that cell, the possible entries in a correct paths table for that cell are (4,4), (4,5), and (5,4). The paths table in the output shown above has chosen (4,4), which corresponds to following the shaded path in Figure 3.27.

The shaded path in Figure 3.27 is built up by starting at the point in the paths table which aligns the entire string intention with the entire string execution. Using 0-based column-row indexing, that is paths[9][9].

>>> paths[9][9]
(8,8)

We then look up (8,8) in the paths table:

>>> paths[8][8]
(7,7)

We then look up the address in (7,7) and so on, until we get to (0,0). We are guaranteed to get to (0,0) because of the possible edit steps at each position in the paths table. This process of constructing the path backward from the end to the beginning is called backtracing. pairs is computed in parallel, while backtracing. What characters go into pairs depends on where you are coming from. If you find (7,7) in the paths table and you are coming from (8,8), that is a substitution step and you add (target[7],source[7]) to pairs. If you are coming to (7,7) from (8,7), you have already seen the character at source[7]. No character is aligned twice, so this step corresponds to aligning the character at target[7] with the empty string; using '0' to represent the empty string, you add (target[7],'0') to pairs. Thus, you have added an alignment pair corresponding to an insertion edit.

You can print the viterbi table in the format used in Figure 6 using print_viterbi_table.

>>> print_viterbi_table(vit0,'#'+target0,'#'+source0)

This will print something like the following:

     #  e  x  e  c  u  t  i  o  n 
   n 09 08 09 10 11 12 11 10 09 08
   o 08 07 08 09 10 11 10 09 08 09
   i 07 06 07 08 09 10 09 08 09 10
   t 06 05 06 07 08 09 08 09 10 11
   n 05 04 05 06 07 08 09 10 11 10
   e 04 03 04 05 06 07 08 09 10 09
   t 03 04 05 06 07 08 07 08 09 08
   n 02 03 04 05 06 07 08 07 08 07
   i 01 02 03 04 05 06 07 06 07 08
   # 00 01 02 03 04 05 06 07 08 09
      #  e  x  e  c  u  t  i  o  n 

Note: the version printed here sticks to the basic format in the book. Rows are printed counting upwards, cols left to right. Column addresses in the viterbi table come before row addresses. The value of vit0[0][2] is 02. The value of vit0[4][1] is 05, the value of vit0[1][4] is 03. Note also: In comparing this example with my code, I am treating "intention" as source and "execution" as target.

Other examples.

>>> (target1,source1) = ('spat','at')
>>> (vit1,paths1,pairs1) = align(target1,source1)
s p a t
0 0 a t
- - - -
1 1 0 0
Total: 2
>>> (target2,source2) = ('faltluence', 'flatulence')
>>> (vit2,paths2,pairs2) = align(target2,source2)
f 0 a l t 0 l u e n c e
f l a 0 t u l 0 e n c e
- - - - - - - - - - - -
0 1 0 1 0 1 0 1 0 0 0 0
Total: 4
>>> (target3,source3) = ('fluency', 'flatulence')
>>> (vit3,paths3,pairs3) = align(target3,source3)
f l 0 0 u 0 e n c y
f l a t u l e n c e
- - - - - - - - - -
0 0 1 1 0 1 0 0 0 2
Total: 5
>>> target4
'drive'
>>> source4
'brief'
>>> target5
'drive'
>>> source5
'divers'

Output for targets 4 and 5 has been suppressed since computing these is part of the assignment.

Submodules [hide private]
  • aligner.align: Compute minimum edit distance between two strings, source and target, using Levenshtein distance.