Package aligner :: Module align
[hide private]
[frames] | no frames]

Module align

source code

Compute minimum edit distance between two strings, source and target, using Levenshtein distance. Return a list of pairs representing the alignment of the two strings resulting from the minimal editing path from source to target.

We build two tables:

We also compute pairs, an alignment of characters in the two strings that corresponds to executing the minimum distance edits that produce target from source; thus, pairs is a least cost alignment.

Functions [hide private]
 
align(target, source)
Compute best alignment using Levenshtein distance.
source code
 
initialize_tables(viterbi, paths, target_len, source_len)
First row and first column are computed at init time since each of those cells has only one possible predecessor cell.
source code
 
follow_path(last, paths, target, source)
This function is called with last set to the cell coordinate for the upper right hand corner of the paths table.
source code
 
print_viterbi_table(viterbi, target, source) source code
 
print_alignment(pairs)
pairs is a sequence of character pairs, such that p[0] is a character from the target string aligned with p[1] from the source string.
source code
Variables [hide private]
int initial_cost = 2147483647
the absurd cost, used to initialize values in the Viterbi table
tuple initial_predecessor = (-1, -1)
the absurd predecessor, a non-existent row-column pair used to initialize values in the paths table.
dictionary edit_costs = {'deletion': 1, 'insertion': 1, 'substitution': 2}
Change this to alter the cost of the three possible editing operations, substitution, deletion, and insertion.
  eps = '0'
the string value representing the empty string in printing alignments.
  a_source0 = '#intention'
  a_target0 = '#execution'
string source0 = 'intention'
example value for source word.
  source1 = 'at'
  source2 = 'flatulence'
  source3 = 'flatulence'
  source4 = 'brief'
  source5 = 'divers'
string target0 = 'execution'
example value for target word.
  target1 = 'spat'
  target2 = 'faltluence'
  target3 = 'fluency'
  target4 = 'drive'
  target5 = 'drive'
Function Details [hide private]

align(target, source)

source code 

Compute best alignment using Levenshtein distance. Return a Viterbi table, a paths table, and a list of pairs.

The two tables:

  • viterbi: a table in which each cell (i,j) contains the cheapest edit cost, an int, for the alignment of source[:i+1] with target[:j+1].
  • paths: a table in which each cell (i,j) contains the (row,col) pair for the predecessor cell in the cheapest edit path to (i,j).

Columns in the table cover target characters, Rows cover source characters.

We also compute pairs, an alignment of characters in the two strings that corresponds to executing the minimum distance edits that produce target from source. This least cost alignment is represented as a sequence of pairs p, such that p[0] (from target) is aligned with p[1] (from source).

initialize_tables(viterbi, paths, target_len, source_len)

source code 

First row and first column are computed at init time since each of those cells has only one possible predecessor cell.

  • First col: viterbi((0,i)) = viterbi((0,i-1)) + deletion_cost
  • First row: viterbi((i,0)) = viterbi((i-1,0)) + insertion_cost

For all other cells, we proceed as follows:

  • in paths we enter in the absurd predecessor initial_predecessor.
  • in viterbi we enter in the absurd cost initial_cost

follow_path(last, paths, target, source)

source code 

This function is called with last set to the cell coordinate for the upper right hand corner of the paths table. This is a pair of integers (i,j) corresponding to positions i in target and j in source. last is the last cell visited in the best edit path. That cell contains the coordinates of the second-to-last step in the best edit path. Going to the previous cell yields the coordinates of the third-to-last cell, and so on, until we are led inevitably back to the first step, which is always (0,0).

Return the list of CHARACTER PAIRS from source and target, corresponding to cell coordinates visited, from first to last.