Damerau Levenshtein
Damerau-Levenshtein is an edit-distance / optimal alignment function. See the wikipedia page on Damerau-Levenshtein for background. It is similar to the Levenshtein function but treats transpositions (swapped letters) as one mistake instead of two:
levenshtein('hello', 'hlelo') = 2
damlev('hello', 'hlelo') = 1
Latest version 2009Nov30 is almost a complete rewrite of the Java code.
Damerau Levenshtein (damlev) for MySQL
Damerau Levenshtein – Levenshtein.java
The Java Levenshtein class (offering also Damerau-Levenshtein methods, perhaps the class name could be better?) is now accessible from the Firtl.com sandpit API as JSON and as XML. The HTML controls below use the API method to compute the difference between two strings:
You can see the Damerau Levenshtein API method more easily if your browser can display XML. Here’s an example: http://www.firtl.com/sandpit/api/string-diff.xml?s1=hello&s2=hlelo&method=damlev
16 Responses to “Damerau Levenshtein”
By tom on Oct 25, 2011 | Reply
wrong algorithm. the damerau levenshtein distance of CA ABC is 2 (swap CA -> AC and insert B after A). your algorithm gives 3.
By Sean on Oct 27, 2011 | Reply
This was pointed out on an older post here – see Alexander’s comments at the bottom:
http://blog.lolyco.com/sean/2008/08/27/damerau-levenshtein-algorithm-levenshtein-with-transpositions/
I still haven’t found the time to read the original article to reassure myself *exactly* what the Damerau-Levenshtein string distance is, so I haven’t changed the code to match the changing Wikipedia article. Thanks for reminding me though! I will try to find the time to do that reading and post an update of some kind.
By Zarathustra on Mar 5, 2012 | Reply
You have an issue in the Levenstein algorithm.
let String A be: вічне коло (eternal circle)
and String B : eternal circle
Your algorithm will return 21 but it is 13.
By Sean on Mar 5, 2012 | Reply
I see 13 – did you try constructing an URL for the XML API service? It should copy your inputs to the XML response. Perhaps it will show an encoding problem or something. I’ll have a look at the logs later…
By Sean on Mar 5, 2012 | Reply
I see responses in the log for ’21’ but perhaps you’ve been caught out by pasting into the short input boxes and couldn’t see whole input? The first string in those requests is “compare вічне коло (eternal circle)”. Here’s a one of the request URLs from your IP address:
http://www.shipping-quote.net/api/string-diff.json?jsonp=sqNetShowStringDiff&s1=compare%20%D0%B2%D1%96%D1%87%D0%BD%D0%B5%20%D0%BA%D0%BE%D0%BB%D0%BE%20%28eternal%20circle%29&s2=eternal%20circle&method=lev
By hashi101 on May 2, 2012 | Reply
Hi
What about unicode characters? If we compare:
damlev(‘ół,’ol’);
How to develop your code to recognizing these string as identical? For “ó Ó ł Ł” and others: “ą Ą ć Ć ę Ę ś Ś n Ń ź Ź ż Ż”
By hashi101 on May 2, 2012 | Reply
And at now it is little strange coz:
levenshtein(“ś”,”s”) return 2 value, but there is one operation needed.
By Sean on May 2, 2012 | Reply
Thanks for the report – I’ll look into it and also into the idea of comparing accented characters, though I suspect that’s not going to be trivial.
The Java code returns 1 as expected for the ‘s’ test. Were you using the MySQL functions? The MySQL functions are well overdue for a code review: the Java code is in better shape. I don’t have MySQL installed at the moment, but I could possibly bring myself to do that review if someone expressed an interest.
I don’t think it’s appropriate for these functions to equate ś and s. If you search for “unicode to ascii transliteration”, it’s a popular problem that doesn’t have a single good solution. I’m sure it would be better to apply your favoured transliteration first and subsequently apply your string comparison.
By hashi101 on May 3, 2012 | Reply
I downloaded libs again and result for levenshtein(“ś”,”s”) in MySql is still 2, but in java example – return 1.
By Sean on May 3, 2012 | Reply
The error is due to the naïve MySQL implementation: on Ubuntu the ‘Character Map’ application tells me “ś” is “C octal escaped UTF-8: \305\233”. The code uses a char pointer on the input strings, so it’s comparing byte-by-byte and there are two bytes.
There’s zero chance of me making a character-based version of this code in the next two weeks, but I may look at it after that. If you were to apply -to-ascii transliteration to your input strings first, this problem would be hidden.
By hashi101 on May 3, 2012 | Reply
Yep. If I’ll apply transliteration to my mySql query, problem would be hidden. I wanna do this. But MySql havent functions for that.
Sample code in PHP:
$str = “ąćś”;
$convertedStr = somePhpFunctionToRemoveAccents($str); //”acs”
mysql_query(“SELECT name FROM table WHERE levenshtein(name,$str)”);
As you can see, transliteration is possible for right value in levenshtein function, becouse right value is generated in PHP. There is no problem. But left value is from database. So this need some function from MySQL. CAST and CONVERT seems to be good for that, but result is ? characters where were accented. There is second option but it may be too slow for big database:
REPLACE( REPLACE(name, “ą”, “a”), “ć”, “c” )… etc.
and look horrible. So if you will have some time after these two weeks, I would be grateful for that plugin.
By Sean on May 3, 2012 | Reply
Sorry Hashi, I doubt I’ll be doing anything more than a code cleanup and some regression tests. It might be a good idea for your project to read some of articles written by others about transliteration – it’s far from straightforward, particularly when you consider the ‘normalisation’ process in there too.
I have had a quick look into the problem, and even character-based levenshtein in a MySQL UDF doesn’t seem easy, as the UDF (as far as I can see) is not aware of the incoming strings’ encoding. I could write a UTF-8-specific function (I suppose it is ASCII-specific as it is), but that won’t solve all the problems that transliteration and normalisation would.
A UTF-8 version might happen later this month but at the rate I work on these kinds of things, I have to recommend you don’t hold your breath while waiting.