Skip to content

If Google would be looking to hire F# programmers – part 7

May 31, 2013

In this installment of this series we turn to string edit distance problem. Edit distance measures number of elementary operations on string characters required to transform one string into the other. Of many possible definitions of edit distance we are going to consider Levenshtein distance: a minimum number of single-character mutations (insert character, delete character, substitute character) required for transformation of source string into the target string.

The following relations hold for Levenshtein distance d between source string s and target string t:

  • d between two empty strings is, apparently, 0
  • d between empty and non-empty string is the length of non-empty string
  • for both non-empty strings if we present them in the form of prefix and the last character each, i.e. sourcep+sc and targetp+tc, then d(sourcep+sc,targetp+tc) = min (
    • d(sourcep,targetp) + if sc = tc then 0 else 1 (whatever it takes to edit sourcep into targetp, then substitute sc by tc, if not equal)
    • d(sourcep+sc,targetp) + 1 (edit sourcep+sc into targetp, then insert tc)
    • d(sourcep,targetp+tc) + 1 (delete sc, then edit sourcep into targetp+tc)
  • )

Although this looks like a case for expressing recurrence via recursion, this is not a good idea for anything, but very short strings as amount of recursive calls on substrings will grow exponentially.

So, as the edit distance is the minimum of edit distances of shorter string(s) this allows to solve the problem with dynamic programming technique. If we represent Levenshtein distances in a matrix holding distances between all prefixes of source and target strings, then we can compute this matrix elements in each row from left to right, then moving to next row from up to down, getting the distance between the full strings as the last computed value. This translates into the following straightforward code:

let levenshteinV1 (src : string) (target : string) =
    let min3 a b c = min a (min b c)
 
    let (distance : int[,]) = Array2D.zeroCreate (src.Length + 1) (target.Length + 1)
 
    for i = 1 to src.Length do distance.[i, 0] <- i
    for j = 1 to target.Length do distance.[0, j] <- j
 
    for j = 1 to target.Length do
        for i = 1 to src.Length do
            distance.[i, j] <-
                min3 (distance.[i-1, j] + 1)
                     (distance.[i, j-1] + 1)
                     (distance.[i-1, j-1] +
                       if src.[i - 1] = target.[j - 1] then 0 else 1)
    distance.[src.Length, target.Length]

Playing with this version gives us:

levenshteinV1 "food" "money"
val it : int = 4
levenshteinV1 "Saturday" "Sunday"
val it : int = 3
levenshteinV1 "kitty" "sitting"
val it : int = 4
levenshteinV1 "algorithm" "altruistic"
val it : int = 6

Exercising it for performance yields the following baseline:

let longString = String.replicate 1000 "fizzbuzz"
let longString' = String.replicate 1000 "xxzzyyzz"
levenshteinV1 longString longString'
Real: 00:00:01.088, CPU: 00:00:01.093, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 4000

Can we do better? Sure! It is easy to spot in the code above, that calculation of each row of distances depends only on the previous row, so we may get rid of dealing with a matrix. Throwing in this and couple of other minor improvements the refactored version follows:

let levenshteinV2 (src: string) (target: string) =
    let min3 a b c = min a (min b c)
    let m,n = src.Length, target.Length
    let prev = Array.init (n+1) id
    let nxt = Array.zeroCreate (n+1)
    for i in 1..m do
        nxt.[0] <- i
        for j in 1..n do
            if src.[i - 1] = target.[j - 1] then
                nxt.[j] <- prev.[j - 1]
            else
                nxt.[j] <- min3 (prev.[j] + 1)
                                (nxt.[j - 1] + 1)
                                (prev.[j - 1] + 1)
        Array.blit nxt 0 prev 0 (n+1)
    nxt.[n]

It’s time profiling yields:

let longString = String.replicate 1000 "fizzbuzz"
let longString' = String.replicate 1000 "xxzzyyzz"
levenshteinV2 longString longString'
Real: 00:00:00.283, CPU: 00:00:00.296, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 4000

Almost 4x time improvement!

But how about pure functional code of the same? Comes out this is doable too:

let levenshteinV3 (src: string) (target: string) =
    let min3 a b c = min a (min b c)
    let targetAsList = target.ToCharArray() |> Array.toList
    let rec distance run (srcTail: char list) (edits: int list) =
        match srcTail with
        | [] -> edits.Item (edits.Length - 1)
        | h::t -> List.fold2 (fun (newList,prev) c o -> 
                        ((min3 (o + 1) (newList.Head + 1) (if h = c then prev else prev + 1))::newList, o))
                        ([run], edits.Head) targetAsList edits.Tail
                        |> fst |> List.rev |> distance (run + 1) t
    distance 0 (src.ToCharArray() |> Array.toList) [ for i in 0..target.Length -> i]

The solution workhorse is inner recursive function distance in line #4 that takes characters of the source string one by one and calculates the list of distances based on initial apparent list of distances. Accumulator of List.fold2 is the new list of distances accompanied by previous element prev of old list of distances as the min3 recurrence dictates.

Unfortunately, this functional purity comes with tremendous performance penalty:


let longString = String.replicate 1000 "fizzbuzz"
let longString' = String.replicate 1000 "xxzzyyzz"
levenshteinV3 longString longString'
Real: 00:00:04.511, CPU: 00:00:04.500, GC gen0: 1001, gen1: 261, gen2: 0
val it : int = 4000

To be continued…

Advertisements

From → Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: