| Home | Trees | Indices | Help |
|---|
|
|
Compute minimum edit distance between two strings, source
and target, using Levenshtein distance. Return a triple
consisting of a viterbi table, a paths table, and a list of pairs. The
list of pairs represents the string alignment resulting from following
the minimal editing path from source to target.
Print out the alignment corresponding to that list of pairs and the
Levenshtein edit distance for that edit path.
The two tables returned are:
The algorithm used follows exactly the code given in J&M for Viterbi alignment (pseudocode). A set of example target source pairs is provided at the end of the align module. Here are some results:
>>> (target0,source0) = ('execution','intention') >>> (vit0,paths0,pairs0) = align(target0,source0) i n t e 0 n t i o n 0 e x e c u t i o n - - - - - - - - - - 1 2 2 0 1 2 0 0 0 0 Total: 8 >>> vit0 [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 6, 7, 8], [2, 3, 4, 5, 6, 7, 8, 7, 8, 7], [3, 4, 5, 6, 7, 8, 7, 8, 9, 8], [4, 3, 4, 5, 6, 7, 8, 9, 10, 9], [5, 4, 5, 6, 7, 8, 9, 10, 11, 10], [6, 5, 6, 7, 8, 9, 8, 9, 10, 11], [7, 6, 7, 8, 9, 10, 9, 8, 9, 10], [8, 7, 8, 9, 10, 11, 10, 9, 8, 9], [9, 8, 9, 10, 11, 12, 11, 10, 9, 8]]
>>> pairs0 [['0', 'i'], ['e', 'n'], ['x', 't'], ['e', 'e'], ['c', '0'], ['u', 'n'], ['t', 't'], ['i', 'i'], ['o', 'o'], ['n', 'n']] >>> paths0[8][8] (7, 7) >>> paths0[7][7] (6, 6) >>> paths0[6][6] (5, 5) >>> paths0[5][5] (4, 4) >>> paths0[4][4] (3, 4) >>> target0[4] 'u' >>> source0[4] 'n' >>> target0[3] 'c' >>> paths0[3][4] (2, 3) >>>
The scores in a correct viterbi table should exactly mirror those shown above for this example, since the scores above are the shortest Levenshtein edit distances for the substrings of these two words. The contents of the paths table may differ, however, since there is generally more than one shortest edit path; in fact, for this example, there are many edit paths that acheive the cheapest edit score of 8. All of them share the property of aligning one of the 'e's in 'execution' with the 'e' in 'intention'. Implementationally, which edit path you get will depend on which choice you make when there are ties for cheapest edit path to a cell. The place where such ties arise are all shown in Figure 3.27. For example, the lowest 8 in the shaded path corresponds to aligning 'inten' with 'execu', and using 0-based column-row indexing, corresponds to paths[5][5]. Since there are three arrows in that cell, the possible entries in a correct paths table for that cell are (4,4), (4,5), and (5,4). The paths table in the output shown above has chosen (4,4), which corresponds to following the shaded path in Figure 3.27.
The shaded path in Figure 3.27 is built up by starting at the point in
the paths table which aligns the entire string intention
with the entire string execution. Using 0-based column-row
indexing, that is paths[9][9].
>>> paths[9][9] (8,8)
We then look up (8,8) in the paths table:
>>> paths[8][8] (7,7)
We then look up the address in (7,7) and so on, until we get to (0,0).
We are guaranteed to get to (0,0) because of the possible edit steps at
each position in the paths table. This process of
constructing the path backward from the end to the beginning is called
backtracing. pairs is computed in parallel,
while backtracing. What characters go into pairs depends on
where you are coming from. If you find (7,7) in the paths table and you
are coming from (8,8), that is a substitution step and you add
(target[7],source[7]) to pairs. If you are coming to (7,7) from (8,7),
you have already seen the character at source[7]. No
character is aligned twice, so this step corresponds to aligning the
character at target[7] with the empty string; using '0' to
represent the empty string, you add (target[7],'0') to
pairs. Thus, you have added an alignment pair corresponding
to an insertion edit.
You can print the viterbi table in the format used in Figure 6 using
print_viterbi_table.
>>> print_viterbi_table(vit0,'#'+target0,'#'+source0)
This will print something like the following:
# e x e c u t i o n
n 09 08 09 10 11 12 11 10 09 08
o 08 07 08 09 10 11 10 09 08 09
i 07 06 07 08 09 10 09 08 09 10
t 06 05 06 07 08 09 08 09 10 11
n 05 04 05 06 07 08 09 10 11 10
e 04 03 04 05 06 07 08 09 10 09
t 03 04 05 06 07 08 07 08 09 08
n 02 03 04 05 06 07 08 07 08 07
i 01 02 03 04 05 06 07 06 07 08
# 00 01 02 03 04 05 06 07 08 09
# e x e c u t i o n
Note: the version printed here sticks to the basic format in the book. Rows are printed counting upwards, cols left to right. Column addresses in the viterbi table come before row addresses. The value of vit0[0][2] is 02. The value of vit0[4][1] is 05, the value of vit0[1][4] is 03. Note also: In comparing this example with my code, I am treating "intention" as source and "execution" as target.
Other examples.
>>> (target1,source1) = ('spat','at') >>> (vit1,paths1,pairs1) = align(target1,source1) s p a t 0 0 a t - - - - 1 1 0 0 Total: 2 >>> (target2,source2) = ('faltluence', 'flatulence') >>> (vit2,paths2,pairs2) = align(target2,source2) f 0 a l t 0 l u e n c e f l a 0 t u l 0 e n c e - - - - - - - - - - - - 0 1 0 1 0 1 0 1 0 0 0 0 Total: 4 >>> (target3,source3) = ('fluency', 'flatulence') >>> (vit3,paths3,pairs3) = align(target3,source3) f l 0 0 u 0 e n c y f l a t u l e n c e - - - - - - - - - - 0 0 1 1 0 1 0 0 0 2 Total: 5 >>> target4 'drive' >>> source4 'brief' >>> target5 'drive' >>> source5 'divers'
Output for targets 4 and 5 has been suppressed since computing these is part of the assignment.
|
|||
| |||
| Home | Trees | Indices | Help |
|---|
| Generated by Epydoc 3.0.1 on Mon Mar 16 12:17:34 2009 | http://epydoc.sourceforge.net |