{"id":622,"date":"2009-11-30T14:51:26","date_gmt":"2009-11-30T06:51:26","guid":{"rendered":"http:\/\/blog.lolyco.com\/sean\/?p=622"},"modified":"2011-08-17T18:56:15","modified_gmt":"2011-08-17T10:56:15","slug":"damerau-levenshtein-java-methods-updated","status":"publish","type":"post","link":"https:\/\/blog.lolyco.com\/sean\/2009\/11\/30\/damerau-levenshtein-java-methods-updated\/","title":{"rendered":"Damerau-Levenshtein Java methods updated"},"content":{"rendered":"<p>I had a long-overdue review of the <a title=\"Damerau Levenshtein demo at spider.my\" href=\"http:\/\/spider.my\/damerau-levenshtein.html\">Damerau-Levenshtein Java methods<\/a> I wrote, extended the regression tests that are included in the class, and discovered a few bugs still hiding in the code. I&#8217;ve pretty much re-written the methods so that they&#8217;re much easier to read, are much closer in structure to the <a title=\"Damerau Levenshtein pseudo code at Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Damerau%E2%80%93Levenshtein_distance\">pseudo-code posted at wikipedia<\/a>, and also perform slightly better.<\/p>\n<div id=\"attachment_626\" style=\"width: 310px\" class=\"wp-caption alignright\"><a href=\"http:\/\/blog.lolyco.com\/sean\/damerau-levenshtein\/\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-626\" class=\"size-full wp-image-626\" title=\"Damerau Levenshtein demo at Spider.my (now at this blog using JSONP)\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/11\/damlevdemo.jpeg\" alt=\"Damerau Levenshtein demo at Spider.my (now at this blog using JSONP)\" width=\"300\" height=\"284\" \/><\/a><p id=\"caption-attachment-626\" class=\"wp-caption-text\">Damerau Levenshtein demo at Spider.my (now at this blog using JSONP)<\/p><\/div>\n<p>The function specifications have changed slightly, with an overloaded version that provides exactly the same interface and behaviour as before, at a slight performance penalty owing to two extra method invocations. The methods previously allocated their own workspace (an int array) for tracking the cost of edits, but now expect a pre-allocated workspace to be passed to them as a parameter. Overloaded methods having the old style of parameters now invoke a class method to allocate a new workspace, and pass the int array along with the supplied parameters to the new methods.<\/p>\n<p>For one-off invocation of the methods, the change\u00a0 introduces a small performance hit. For repeated invocations, I suggest your calling code allocate the workspace, and reuse it in each call to the edit-distance methods:<\/p>\n<p><code>int[] aResult = new int[candidateArray.length];<br \/>\nint[] workspace<br \/>\n&nbsp;&nbsp;= Levenshtein.getWorkspace(matchString.length()<br \/>\n&nbsp;&nbsp;, longestCandidate.length());<br \/>\nint i\u00a0 = 0;<br \/>\nfor (String candidate : candidateArray)<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;candidateArray[i++]<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= damlev(matchString, candidate, workspace);<br \/>\n<\/code><br \/>\nFrom tests here, you should expect a 6-10% performance boost &#8211; using a pre-allocated workspace &#8211; on unlimited (whole strings are tested) invocations of the functions. If you&#8217;re using the limited (levlim, damlevlim) versions, the speedup can be enormous, depending on how poorly matched your candidate strings are. I haven&#8217;t tested with a &#8216;real-world&#8217; data-set yet, but a worst-case &#8216;dumb test&#8217; of two totally different strings gives average timings of down to 16 nanoseconds for the pre-allocated workspace, 4,500 for the locally allocated workspace. If your workload consists of very-closely matched strings, you&#8217;ll get a performance boost closer to the 6-10% above.<\/p>\n<p>You can try out the methods at <a title=\"Damerau Levenshtein demo at spider.my\" href=\"http:\/\/spider.my\/damerau-levenshtein.html\">spider.my&#8217;s Damerau Levenshtein demo page<\/a>, and download the code from there too.<\/p>\n<p>The MySQL UDFs are probably badly in need of a code review, and I never did write a proper set of test cases for them, so they&#8217;re next. I&#8217;m not actually using them in any projects at the moment, so they&#8217;re not next in terms of priority. The first person to state a preference for recently-maintained MySQL UDFs get&#8217;s a free code review and regression tests.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I had a long-overdue review of the Damerau-Levenshtein Java methods I wrote, extended the regression tests that are included in the class, and discovered a few bugs still hiding in the code. I&#8217;ve pretty much re-written the methods so that they&#8217;re much easier to read, are much closer in structure to the pseudo-code posted at [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-622","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/622","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/comments?post=622"}],"version-history":[{"count":9,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/622\/revisions"}],"predecessor-version":[{"id":1312,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/622\/revisions\/1312"}],"wp:attachment":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/media?parent=622"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/categories?post=622"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/tags?post=622"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}