Damerau-Levenshtein algorithm: Levenshtein with transpositions
August 27th, 2008I’m still working away slowly at Spider.my, and spotted a funny loop in the search suggestions:
The helpful hint is “maybe ‘thursday’ would get more results?”. I’m using a simple Levenshtein distance algorithm to provide hints when only a few results appear. Because spider.my has very few sites in its database at the moment, nearly every search produces a handy (actually somewhere between funny and annoying!) hint. The basic Levenshtein algorithm doesn’t test for transpositions – when 2 keys are pressed in the wrong order. A test like “levenshtein(‘teusday’, ‘tuesday’)” will give an answer of 2, for the two substitutions. Similarly, “levenshtein(‘teusday’, ‘thursday’)” gives a result of 2, for the substitution of ‘e’ with ‘h’ and the insertion of ‘r’. My hint code orders results by levenshtein distance only, so ‘thursday’ appears before ‘tuesday’ in the results due to caprice on the part of MySQL.I’m sure there’s a better reason than caprice, but I don’t often use the word, and I don’t know why the order is that way, so … caprice it is. I could also order my results by similarity in length to the original keyword. That’s a sensible idea that only just struck me, but it didn’t at the crucial time, so I wrote a lot of code instead.
Clicking on the helpful hint above starts the circle:
Again, not many results due to an empty database, and the hint code excludes suggesting the same word, so the hint sends us back to Tuesday. Tuesday and Thursday have the same Levenshtein distance from each other, so around we go. This loop isn’t likely to go away with an improved distance function: a ‘closest’ suggestion for some word will itself suggest the original word if distance is the only criterion.
The problem on spider.my is a consequence of the near-empty database. If more results were returned from a search, the suggestion would only appear for an uncommon (or mis-spelt) keyword. If a more common keyword is suggested, no suggestion will be made.
I read on Wikipedia that the Levenshtein function doesn’t test for transpositions, and my original faulty keyword ‘teusday’ is a good example: the ‘e’ and the ‘u’ are transposed – they are each in the other’s position. The Wikipedia article links to another Wikipedia article with an algorithm that does include transposition checking – Damerau Levenshtein.
While I was editing Josh Drew’s Levenshtein code, I made a few other changes. Maybe they’re not to everybody’s taste, so here are 3 different versions of my Damerau-Levenshtein function for MySQL based on Josh Drew’s code:
1. Basic Damerau-Levenshtein distance (damlev(string, string)), minimal changes to original code:
Damerau-Levenshtein distance source code
– slightly more time to execute (about 10%?) than levenshtein().
2. Damerau-Levenshtein distance, with limit (damlev(string, string, int)) to save computation time on very dissimilar strings:
Damerau-Levenshtein distance with limit source code
– can execute in a fraction of the time (from 50 seconds average down to 13 seconds average in one of my tests) of damlev() with low limits. Worst case (when limit > string length) very similar time to damlev(string, string). Best gains on long strings, low limit.
3. Damerau-Levenshtein distance, with limit and performance hack (damlevlim256(string, string, int) to pre-load row and column starts:
Damerau-Levenshtein distance with limit and 256 hack source code
– executes in less than 90% of time of damlevlim in tests. Time saving greater for longer strings, but not much. 256 ( – 1) is maximum length of string arguments (currently DOES NOT TEST for argument length!). The pre-loaded table is 64K in size, because that should be enough for anybody đŸ˜€
For installation instructions, see my search suggestions with MYSQL Levenshtein distance article. Let me know how you get on with any or all of those, I hope you find them useful. I wrote the Damerau add-ons in Java first, so I can post those up too if anybody is interested here they are. Now on the Damerau Levenshtein page
Just for comparison:
levenshtein('teusday', 'tuesday') = 2
levenshtein('teusday', 'thursday') = 2
damlev('teusday', 'tuesday') = 1
damlev('teusday', 'thursday') = 2
damlev('tuesday', 'something') = 8
damlevlim('tuesday', 'something', 3) = 3 (returns early)