The modules in this package export functions which measure the so called edit distance (also called Levenshtein distance) between two strings, a source and a target. The edit distance is defined as the number of deletions, insertions, or substitutions required to transform the source into the target. The greater the distance, the more different the strings are, and vice versa. Edit distance can be (and has been) used for spell checking and speech recognition purposes.
The distribution contains two functionally equivalent implementations, one in C linked into Oz, and one in pure Oz. They are both straightforward implementations of Levenshtein's algorithm - a dynamic programming algorithm capable of calculating the edit distance in time proportional to the length of the source times the length of the target. The C-based version is roughly eight times faster than the pure Oz version, and is therefore recommended for serious use.
Download the package, and invoke ozmake in a shell as follows:
ozmake --install --package=lager-levenshtein.pkg
By default, all files of the package are installed in the user's ~/.oz
directory tree. In particular, all modules are installed in the user's private
cache.
The C version:
import Levenshtein at 'x-ozlib://lager/levenshtein/Levenshtein.so{native}'
...{Levenshtein.distance +S1 +S2 ?I}
The pure Oz version:
import Levenshtein at 'x-ozlib://lager/levenshtein/Levenshtein.ozf'
...{Levenshtein.distance +S1 +S2 ?I}
For example,
{Levenshtein.distance "Mozart" "Mogul" I}
binds I
to 4
, since three substitutions and one deletion
is required to transform "Mozart" into "Mogul".
The distribution also includes match
, a stand-alone demo application
which reads and tokenizes a text file and prints each token - if it is within
a given edit distance to a given target word - to a file, or to stdout
by default. It can be invoked in the following way:
bash$ ./match -i test.txt -t lovely -l 1 -h 2 closely levels Execution time: 40 ms bash$
Invoke match
--help
to see what options are available.
Among other things, there is an option that will let you compare the speed of
the two implementations.
A thread discussing fast edit distance algorithms appeared on a the corpora mailing list a couple of years ago, and is summarized here:
Note that the C code used in the native functor described above appears at the very end of this document, and that it was contributed by Gertjan van Noord.