Damerau-Levenshtein algorithm: Levenshtein with transpositions
August 27th, 2008 | by Sean |I’m still working away slowly at Spider.my, and spotted a funny loop in the search suggestions:
The helpful hint is “maybe ‘thursday’ would get more results?”. I’m using a simple Levenshtein distance algorithm to provide hints when only a few results appear. Because spider.my has very few sites in its database at the moment, nearly every search produces a handy (actually somewhere between funny and annoying!) hint. The basic Levenshtein algorithm doesn’t test for transpositions – when 2 keys are pressed in the wrong order. A test like “levenshtein(‘teusday’, ‘tuesday’)” will give an answer of 2, for the two substitutions. Similarly, “levenshtein(‘teusday’, ‘thursday’)” gives a result of 2, for the substitution of ‘e’ with ‘h’ and the insertion of ‘r’. My hint code orders results by levenshtein distance only, so ‘thursday’ appears before ‘tuesday’ in the results due to caprice on the part of MySQL.I’m sure there’s a better reason than caprice, but I don’t often use the word, and I don’t know why the order is that way, so … caprice it is. I could also order my results by similarity in length to the original keyword. That’s a sensible idea that only just struck me, but it didn’t at the crucial time, so I wrote a lot of code instead.
Clicking on the helpful hint above starts the circle:
Again, not many results due to an empty database, and the hint code excludes suggesting the same word, so the hint sends us back to Tuesday. Tuesday and Thursday have the same Levenshtein distance from each other, so around we go. This loop isn’t likely to go away with an improved distance function: a ‘closest’ suggestion for some word will itself suggest the original word if distance is the only criterion.
The problem on spider.my is a consequence of the near-empty database. If more results were returned from a search, the suggestion would only appear for an uncommon (or mis-spelt) keyword. If a more common keyword is suggested, no suggestion will be made.
I read on Wikipedia that the Levenshtein function doesn’t test for transpositions, and my original faulty keyword ‘teusday’ is a good example: the ‘e’ and the ‘u’ are transposed – they are each in the other’s position. The Wikipedia article links to another Wikipedia article with an algorithm that does include transposition checking – Damerau Levenshtein.
While I was editing Josh Drew’s Levenshtein code, I made a few other changes. Maybe they’re not to everybody’s taste, so here are 3 different versions of my Damerau-Levenshtein function for MySQL based on Josh Drew’s code:
1. Basic Damerau-Levenshtein distance (damlev(string, string)), minimal changes to original code:
Damerau-Levenshtein distance source code
– slightly more time to execute (about 10%?) than levenshtein().
2. Damerau-Levenshtein distance, with limit (damlev(string, string, int)) to save computation time on very dissimilar strings:
Damerau-Levenshtein distance with limit source code
– can execute in a fraction of the time (from 50 seconds average down to 13 seconds average in one of my tests) of damlev() with low limits. Worst case (when limit > string length) very similar time to damlev(string, string). Best gains on long strings, low limit.
3. Damerau-Levenshtein distance, with limit and performance hack (damlevlim256(string, string, int) to pre-load row and column starts:
Damerau-Levenshtein distance with limit and 256 hack source code
– executes in less than 90% of time of damlevlim in tests. Time saving greater for longer strings, but not much. 256 ( – 1) is maximum length of string arguments (currently DOES NOT TEST for argument length!). The pre-loaded table is 64K in size, because that should be enough for anybody 😀
For installation instructions, see my search suggestions with MYSQL Levenshtein distance article. Let me know how you get on with any or all of those, I hope you find them useful. I wrote the Damerau add-ons in Java first, so I can post those up too if anybody is interested here they are. Now on the Damerau Levenshtein page
Just for comparison:
levenshtein('teusday', 'tuesday') = 2
levenshtein('teusday', 'thursday') = 2
damlev('teusday', 'tuesday') = 1
damlev('teusday', 'thursday') = 2
damlev('tuesday', 'something') = 8
damlevlim('tuesday', 'something', 3) = 3 (returns early)
26 Responses to “Damerau-Levenshtein algorithm: Levenshtein with transpositions”
By J.L. on Aug 30, 2008 | Reply
Thanks for this. I found this on Google just last night as I was looking for something a little better than the standard Levenshtein udf, and came across this.
It works really well. I particularly appreciate the limit argument, very useful.
By Jon on Sep 17, 2008 | Reply
Very useful indeed, thanks for posting.
By David Nygaard on Oct 1, 2008 | Reply
Hi, first of all, great site, really helpfull!
So thanks for that!
you wrote that you also had the code for Damerau in Java, which im really interested in so could you post that or email it to me? I would owe you a cold beer or a hot cup of coffee at the very least.
Thanks
David
By Sean on Oct 3, 2008 | Reply
Completely untested, for reference only Java code posted here:
http://blog.lolyco.com/sean/2008/10/03/damerau-levenshtein-in-java/
Enjoy!
By Gábor on Apr 9, 2009 | Reply
Thank you very much, very useful!
For anyone interested, I managed to compile these UDFs on Mac OS X v.10.5.6, for MySQL 5.0.51b, even though the includes came from MySQL 4.1.22.
Anyway, the steps were:
1. gcc -c -o libdamlev.o -I /opt/local/var/macports/software/mysql4/4.1.22_0/opt/local/include/mysql damlev.cpp
2. g++ -dynamiclib -std=gnu99 libdamlev.o -current_version 1.1 -compatibility_version 1.1 -o libdamlev.dylib
3. Then, with MySQL already running:
sudo cp libdamlev.dylib /usr/lib/libdamlev.dylib
4. In MySQL
CREATE FUNCTION damlev RETURNS INT SONAME ‘libdamlev.dylib’;
5. Test it in MySQL
select damlev(‘privacy’,’rpivacy’);
Which gives the value: 1
That’s it. (Got help from: http://forums.mysql.com/read.php?118,50950,66155#msg-66155 )
By Tomas on Apr 16, 2009 | Reply
Hi, i run into a strange behaviour of the damlevlim function and wanted to share it with you because i havent been able to find the solution.
if i call damlevlim(“hello”,”h”,2) i get result 2 which is the expected result.
but when the second string is longer than the first one i get this strange behaviour:
if i call damlevlim(“h”,”hello”,1) i get 4
if i call damlevlim(“h”,”hello”,2) i get 4
if i call damlevlim(“h”,”hello”,3) i get 4
the limit is not applied.
have you expirienced this? do youo know a way to fix this?
By Sean on Apr 16, 2009 | Reply
Hey Tomas, thanks for reporting. I hadn’t experienced that bug, because I always apply the damlev UDFs to strings whose lengths are within 1 character of the 1st argument! It was quite obvious where I’d gone wrong in the code when I looked. I’ve posted new versions on a new Damerau Levenshtein page. Let me know if the bug is squashed, please!
By mykel on May 19, 2009 | Reply
Hi Sean,
thanks for your useful implementation of damlev!
writing the udf is outside of my skills, but its something ive wanted for a while.
i have a request though – is there any way you could make damlev case-insensitive? or instruct me how to do so?
i will follow up this comment in the coming days/weeks.
By Sean on Jun 12, 2009 | Reply
Hello Mykel – it wouldn’t be terribly difficult to make a case-insensitive version of damlev, I expect. I’m not a C++ programmer, so if I was to extend the MySQL UDFs, I’d have to spend a lot of time reading about character encodings to make sure I wasn’t doing anything daft. The Java methods would be easier for me to change, but keeping them as simple as possible seems like a good idea.
When I use damlev in my applications, I take care of case and vastly different string lengths in the arguments before I invoke damlev.
By David on Jun 14, 2009 | Reply
Greetings…
Interesting code, specially the Java one, which allowed me not to program the whole thing again… 🙂
Do you have any licesing requirements for using your class in production code for my employer? Besides attribution, copyright, etc, of course.
thanks a lot!
By Sean on Jun 14, 2009 | Reply
I have no licensing requirements at all David.
I’d be grateful for attribution.
By David on Jun 15, 2009 | Reply
Sean,
Cool. Worry not, if I end using it, I’ll be sure to put in all info about you, your code and blog on the headers, dependencies and documentation…
Thanks and keep it up!
By Björn Törnqvist on Aug 11, 2009 | Reply
Hi Sean,
Thanks for providing your Leveshtein methods.
I believe I found a bug with the damlevlim method. I think it’s easiest described by the following example:
damlevlim(“short”,”shoort”,2);
Result: 2 (should be 1)
damlevlim(“short”,”shrt”,2);
Result: 2 (should be 1)
As you can see the second string has an additional character or is missing a character inside the string.
Can you take a look at it?
Best regards
/Björn Törnqvist
By Sean on Aug 11, 2009 | Reply
Hello Björn, thanks for reporting. I can confirm that error, I’ll look into it this evening and let you know later
Sean
–update 2009 Aug 15 New version with regression fixed (and added to static main method tests) uploaded same day. Forgot to update my comment here!
By Jon on Sep 14, 2009 | Reply
Is the bug fixed in http://spider.my/static/contrib/damlev.zip? I see no differences with my copy of damlev256.cpp..
By Sean on Sep 14, 2009 | Reply
Do you have the bug? I have no regression tests for the MySQL code at at the moment. I must get round to it at some point, but I’ve got very little time at the moment. The bug was a trivial coding error in the Java Levenshtein methods in a part of the code that doesn’t exactly mirror the C++ code in the MySQL functions. I just checked the damlevlim256 on a server here and it doesn’t show the problem Bjorn reported.
If you can send me an example of a bad return value, I’ll check it out as soon as I can.
By Jon on Sep 15, 2009 | Reply
You’re right, it doesn’t show. I have the habit of trying to fix things that are not broken 🙂
By Alexander on Feb 5, 2010 | Reply
Hi Sean,
Thanks for your implementations!
I noticed that damlev(“TO”,”OST”) returns 3, while it should be 2 in this case (the test actually is from Wikipedia article on Damlev distance).
Do you think it can be fixed?
Thanks!
By Sean on Feb 5, 2010 | Reply
Hello Alexander – thanks for your comment. Last time I updated the Java damlev functions (still haven’t got round to doing the same for the MySQL ones), I noticed some counter-intuitive (but not actually ‘wrong’) results. In Wikipedia’s article on Damerau-Levenshtein the example you refer to is used to show a ‘wrinkle’ in the algorithm:
You and I both know the edit distance between ‘TO’ and ‘OST’ is 2 (swap T for O to get OT, insert S), but Damerau Levenshtein works on single operations on the original string. The ‘OT’ is an intermediate result (a substring that has been previously edited). The wikipedia article does mention improved algorithms for what is described as “a proper algorithm to calculate unrestricted Damerau–Levenshtein distance”. I suspect if it gives you a different result to the original algorithm, it possibly isn’t ‘Damerau-Levenshtein’! Maybe one day I’ll get round to having a look at those other algorithms, but there’s absolutely no chance in the next few months.
(You can see the result in question on the damerau-levenshtein page at spider.my)
By Alexander on Feb 7, 2010 | Reply
Sean,
I believe that if we define Damerau-Levenshtein distance as “minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters” (as it is done in Wikipedia, and also in this http://portal.acm.org/citation.cfm?id=356827.356830 article, where the notion seems to be introduced for the first time), then an algorithm calculating this distance should return 2 for (“TO”, “OST”). Otherwise the algorithm is calculating something else 🙂
But that doesn’t actually matter. Sorry for the long comment, and thank you for your answer!