{"id":116,"date":"2008-08-27T23:26:41","date_gmt":"2008-08-27T15:26:41","guid":{"rendered":"http:\/\/blog.lolyco.com\/sean\/?p=116"},"modified":"2009-04-16T11:11:14","modified_gmt":"2009-04-16T03:11:14","slug":"damerau-levenshtein-algorithm-levenshtein-with-transpositions","status":"publish","type":"post","link":"https:\/\/blog.lolyco.com\/sean\/2008\/08\/27\/damerau-levenshtein-algorithm-levenshtein-with-transpositions\/","title":{"rendered":"Damerau-Levenshtein algorithm: Levenshtein with transpositions"},"content":{"rendered":"<p>I&#8217;m still working away slowly at <a title=\"Go to My Spider\" href=\"http:\/\/spider.my\">Spider.my<\/a>, and spotted a funny loop in the <strong>search suggestions<\/strong>:<\/p>\n<p><div id=\"attachment_117\" style=\"width: 310px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/searchtuesday1.jpeg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-117\" class=\"size-medium wp-image-117\" title=\"Search for Teusday &lt;i&gt;[sic]&lt;\/i&gt; - how about 'thursday'?\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/searchtuesday1.jpeg\" alt=\"Search for Teusday - a mis-spelling\" width=\"300\" height=\"151\" \/><\/a><p id=\"caption-attachment-117\" class=\"wp-caption-text\">Search for Teusday - how about Thursday?<\/p><\/div>The helpful hint is &#8220;maybe &#8216;thursday&#8217; would get more results?&#8221;. I&#8217;m using a <a title=\"Levenshtein distance article\" href=\"http:\/\/blog.lolyco.com\/sean\/2008\/08\/06\/search-suggestions-with-mysql-levenshtein-distance\/\">simple Levenshtein distance algorithm<\/a> to provide <strong>hints<\/strong> when only a few results appear. Because spider.my has <strong>very few sites<\/strong> in its database at the moment, nearly every search produces a handy (actually somewhere between funny and annoying!) hint. The <a title=\"Wikipedia's article on Levenshtein distance\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">basic Levenshtein algorithm doesn&#8217;t test for transpositions<\/a> &#8211; when 2 <strong>keys<\/strong> are <strong>pressed<\/strong> in the <strong>wrong order<\/strong>. A test like &#8220;levenshtein(&#8216;teusday&#8217;, &#8216;tuesday&#8217;)&#8221; will give an answer of 2, for the two <strong>substitutions<\/strong>. Similarly, &#8220;levenshtein(&#8216;teusday&#8217;, &#8216;thursday&#8217;)&#8221; gives a result of 2, for the substitution of &#8216;e&#8217; with &#8216;h&#8217; and the <strong>insertion<\/strong> of &#8216;r&#8217;. My hint code orders results by <strong>levenshtein distance<\/strong> only, so &#8216;thursday&#8217; appears before &#8216;tuesday&#8217; in the results due to <strong>caprice<\/strong> on the part of <strong>MySQL<\/strong>.<\/p>\n<p>I&#8217;m sure there&#8217;s a better <strong>reason<\/strong> than caprice, but I don&#8217;t often use the word, and I <strong>don&#8217;t know<\/strong> why the order is that way, so &#8230; caprice it is. I could also order my results by <strong>similarity in length<\/strong> to the original keyword. That&#8217;s a <strong>sensible<\/strong> idea that only just struck me, but it didn&#8217;t at the <strong>crucial time<\/strong>, so I wrote a lot of code instead.<\/p>\n<p>Clicking on the helpful hint above starts the circle:<\/p>\n<div id=\"attachment_118\" style=\"width: 310px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/searchtuesday2.jpeg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-118\" class=\"size-medium wp-image-118\" title=\"Not many results for Thursday, how about Tuesday?\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/searchtuesday2.jpeg\" alt=\"Not many results for Tuesday, how about Thursday?\" width=\"300\" height=\"112\" \/><\/a><p id=\"caption-attachment-118\" class=\"wp-caption-text\">Not many results for Thursday, how about Tuesday?<\/p><\/div>\n<p>Again, not many results due to an <strong>empty database<\/strong>, and the hint code excludes suggesting the same word, so the hint sends us back to Tuesday.\u00a0 Tuesday and Thursday have the same Levenshtein distance from each other, so <strong>around<\/strong> we go. This loop isn&#8217;t likely to go away with an <strong>improved<\/strong> distance function: a &#8216;closest&#8217; suggestion for some word will itself suggest the original word <strong>if distance is the only criterion<\/strong>.<\/p>\n<p>The problem on spider.my is a consequence of the near-empty database. If more results were returned from a search, the suggestion would only appear for an <strong>uncommon<\/strong> (or mis-spelt) keyword. If a more common keyword is suggested, no suggestion will be made.<\/p>\n<p>I read on <strong>Wikipedia<\/strong> that the Levenshtein function doesn&#8217;t test for <strong>transpositions<\/strong>, and my original faulty keyword &#8216;teusday&#8217; is a good example: the &#8216;e&#8217; and the &#8216;u&#8217; are <strong>transposed<\/strong> &#8211; they are each in the other&#8217;s position. The Wikipedia article links to another Wikipedia article with an <a title=\"Damerau Levenshtein distance at Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Damerau-Levenshtein_distance\">algorithm that does include transposition checking &#8211; Damerau Levenshtein<\/a>.<\/p>\n<p>While I was <strong>editing<\/strong> Josh Drew&#8217;s Levenshtein code, <strong>I made a few other changes<\/strong>. Maybe they&#8217;re not to everybody&#8217;s taste, so here are 3 different versions of my <strong>Damerau-Levenshtein function<\/strong> for MySQL based on Josh Drew&#8217;s code:<\/p>\n<p>1. Basic Damerau-Levenshtein distance (damlev(string, string)), minimal changes to original code:<\/p>\n<p><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/damlev.cpp\">Damerau-Levenshtein distance source code<\/a><\/p>\n<p>&#8211; slightly <strong>more time<\/strong> to execute (about 10%?) than levenshtein().<\/p>\n<p>2. Damerau-Levenshtein distance, with <strong>limit<\/strong> (damlev(string, string, int)) to save <strong>computation<\/strong> time on very <strong>dissimilar<\/strong> strings:<\/p>\n<p><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/damlevlim.cpp\">Damerau-Levenshtein distance with limit source code<\/a><\/p>\n<p>&#8211; can <strong>execute<\/strong> in a <strong>fraction<\/strong> of the time (from 50 seconds average down to 13 seconds average in one of my tests) of damlev() with low limits. <strong>Worst case<\/strong> (when limit &gt; string length) <strong>very similar<\/strong> time to damlev(string, string). Best gains on long strings, low limit.<\/p>\n<p>3. Damerau-Levenshtein distance, with limit and <strong>performance hack<\/strong> (damlevlim256(string, string, int) to pre-load row and column starts:<\/p>\n<p><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/damlevlim2561.cpp\">Damerau-Levenshtein distance with limit and 256 hack source code<\/a><\/p>\n<p>&#8211; executes in less than 90% of time of damlevlim in tests. Time saving greater for longer strings, but <strong>not much<\/strong>. 256 ( &#8211; 1) is maximum length of string arguments (<strong>currently DOES NOT TEST for argument length!<\/strong>). The pre-loaded table is 64K in size, because that should be <strong>enough for anybody<\/strong> \ud83d\ude00<\/p>\n<p>For <strong>installation<\/strong> instructions, see my <a title=\"Search suggestions with MYSQL Levenshtein distance\" rel=\"bookmark\" href=\"http:\/\/blog.lolyco.com\/sean\/2008\/08\/06\/search-suggestions-with-mysql-levenshtein-distance\/\">search suggestions with MYSQL Levenshtein distance<\/a> article. <strong>Let me know<\/strong> how you get on with any or all of those, I hope you find them useful. I wrote the Damerau add-ons in <strong>Java<\/strong> first, so <del datetime=\"2008-10-03T09:01:55+00:00\">I can post those up too if anybody is interested<\/del> <span style=\"color: #ff0000;\"><span style=\"text-decoration: line-through;\"><a title=\"Damerau Levenshtein methods in Java\" href=\"http:\/\/blog.lolyco.com\/sean\/2008\/10\/03\/damerau-levenshtein-in-java\/\">here they are<\/a><\/span><\/span>. <a title=\"Damerau Levenshtein MySQL UDF \/ Java \" href=\"http:\/\/blog.lolyco.com\/sean\/damerau-levenshtein\/\">Now on the Damerau Levenshtein page<\/a><\/p>\n<p>Just for comparison:<\/p>\n<p><code>levenshtein('teusday', 'tuesday') = 2<br \/>\nlevenshtein('teusday', 'thursday') = 2<br \/>\ndamlev('teusday', 'tuesday') = 1<br \/>\ndamlev('teusday', 'thursday') = 2<br \/>\ndamlev('tuesday', 'something') = 8<br \/>\ndamlevlim('tuesday', 'something', 3) = 3 (returns early)<br \/>\n<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m still working away slowly at Spider.my, and spotted a funny loop in the search suggestions: The helpful hint is &#8220;maybe &#8216;thursday&#8217; would get more results?&#8221;. I&#8217;m using a simple Levenshtein distance algorithm to provide hints when only a few results appear. Because spider.my has very few sites in its database at the moment, nearly [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[36,3,35],"tags":[39,9,18,105],"class_list":["post-116","post","type-post","status-publish","format-standard","hentry","category-search","category-software","category-spidermy","tag-mysql","tag-open-source","tag-server","tag-software"],"_links":{"self":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/comments?post=116"}],"version-history":[{"count":11,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/116\/revisions"}],"predecessor-version":[{"id":127,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/116\/revisions\/127"}],"wp:attachment":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/media?parent=116"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/categories?post=116"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/tags?post=116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}