# If Google would be looking to hire F# programmers – part 7

In this installment of this series we turn to string edit distance problem. Edit distance measures number of elementary operations on string characters required to transform one string into the other. Of many possible definitions of edit distance we are going to consider Levenshtein distance: a minimum number of single-character mutations (insert character, delete character, substitute character) required for transformation of source string into the target string.

The following relations hold for Levenshtein distance **d** between source string **s** and target string **t**:

**d**between two empty strings is, apparently,**0****d**between empty and non-empty string is the**length of non-empty string**- for both non-empty strings if we present them in the form of prefix and the last character each, i.e.
**sourcep+sc**and**targetp+tc**, then**d(sourcep+sc,targetp+tc) = min (****d(sourcep,targetp) + if sc = tc then 0 else 1**(whatever it takes to edit**sourcep**into**targetp**, then substitute**sc**by**tc**, if not equal)**d(sourcep+sc,targetp) + 1**(edit**sourcep+sc**into**targetp**, then insert**tc**)**d(sourcep,targetp+tc) + 1**(delete**sc**, then edit**sourcep**into**targetp+tc**)

**)**

Although this looks like a case for expressing recurrence via recursion, this is not a good idea for anything, but very short strings as amount of recursive calls on substrings will grow exponentially.

So, as the edit distance is the minimum of edit distances of shorter string(s) this allows to solve the problem with dynamic programming technique. If we represent Levenshtein distances in a matrix holding distances between all prefixes of source and target strings, then we can compute this matrix elements in each row from left to right, then moving to next row from up to down, getting the distance between the full strings as the last computed value. This translates into the following straightforward code:

let levenshteinV1 (src : string) (target : string) = let min3 a b c = min a (min b c) let (distance : int[,]) = Array2D.zeroCreate (src.Length + 1) (target.Length + 1) for i = 1 to src.Length do distance.[i, 0] <- i for j = 1 to target.Length do distance.[0, j] <- j for j = 1 to target.Length do for i = 1 to src.Length do distance.[i, j] <- min3 (distance.[i-1, j] + 1) (distance.[i, j-1] + 1) (distance.[i-1, j-1] + if src.[i - 1] = target.[j - 1] then 0 else 1) distance.[src.Length, target.Length]

Playing with this version gives us:

levenshteinV1 "food" "money"

val it : int = 4

levenshteinV1 "Saturday" "Sunday"

val it : int = 3

levenshteinV1 "kitty" "sitting"

val it : int = 4

levenshteinV1 "algorithm" "altruistic"

val it : int = 6

Exercising it for performance yields the following baseline:

let longString = String.replicate 1000 "fizzbuzz"

let longString' = String.replicate 1000 "xxzzyyzz"

levenshteinV1 longString longString'

Real: 00:00:01.088, CPU: 00:00:01.093, GC gen0: 0, gen1: 0, gen2: 0

val it : int = 4000

Can we do better? Sure! It is easy to spot in the code above, that calculation of each row of distances depends only on the previous row, so we may get rid of dealing with a matrix. Throwing in this and couple of other minor improvements the refactored version follows:

let levenshteinV2 (src: string) (target: string) = let min3 a b c = min a (min b c) let m,n = src.Length, target.Length let prev = Array.init (n+1) id let nxt = Array.zeroCreate (n+1) for i in 1..m do nxt.[0] <- i for j in 1..n do if src.[i - 1] = target.[j - 1] then nxt.[j] <- prev.[j - 1] else nxt.[j] <- min3 (prev.[j] + 1) (nxt.[j - 1] + 1) (prev.[j - 1] + 1) Array.blit nxt 0 prev 0 (n+1) nxt.[n]

It’s time profiling yields:

let longString = String.replicate 1000 "fizzbuzz"

let longString' = String.replicate 1000 "xxzzyyzz"

levenshteinV2 longString longString'

Real: 00:00:00.283, CPU: 00:00:00.296, GC gen0: 0, gen1: 0, gen2: 0

val it : int = 4000

Almost **4x** time improvement!

But how about *pure functional* code of the same? Comes out this is doable too:

let levenshteinV3 (src: string) (target: string) = let min3 a b c = min a (min b c) let targetAsList = target.ToCharArray() |> Array.toList let rec distance run (srcTail: char list) (edits: int list) = match srcTail with | [] -> edits.Item (edits.Length - 1) | h::t -> List.fold2 (fun (newList,prev) c o -> ((min3 (o + 1) (newList.Head + 1) (if h = c then prev else prev + 1))::newList, o)) ([run], edits.Head) targetAsList edits.Tail |> fst |> List.rev |> distance (run + 1) t distance 0 (src.ToCharArray() |> Array.toList) [ for i in 0..target.Length -> i]

The solution workhorse is inner recursive function **distance **in line #4 that takes characters of the source string one by one and calculates the list of distances based on initial apparent list of distances. Accumulator of **List.fold2** is the new list of distances accompanied by previous element **prev** of old list of distances as the **min3** recurrence dictates.

Unfortunately, this functional purity comes with tremendous performance penalty:

let longString = String.replicate 1000 "fizzbuzz"

let longString' = String.replicate 1000 "xxzzyyzz"

levenshteinV3 longString longString'

Real: 00:00:04.511, CPU: 00:00:04.500, GC gen0: 1001, gen1: 261, gen2: 0

val it : int = 4000

To be continued…

## Trackbacks & Pingbacks