Java Damerau Levenshtein update

March 30th, 2009 | by Sean |

There’s currently a fairly serious issue with character encoding at spider.my – the knock-on effect of which is far too many entries in the keyword table. The MySQL Damerau Levenshtein UDF I wrote is fast, but the disks in my database server are not. A spelling suggestion was taking several seconds to compute – far too slow for an interactive application.

I wrote a quick Java RMI server to load all the keywords into memory, and used the java code I wrote (but never previously used) to scan through for the best match. It’s still far from ideal, but seems to execute in less than 10% of the time the database takes to do the same thing. This isn’t an example of Java’s performance supremacy over C++, but the advantage of writing a dedicated application to do a single task.

The reason I wrote the Java code was to help understand the original C code the UDFs were based on. When I came to use them to get the spelling suggestions back online, I realised there was a fairly serious bug in them. Because of a transposition of string lengths, an example such as damlevlim(“speling”, “spelling”, 3) was returning 3 instead of 1. That bug is now squashed.

If you want to test the code, it’s now the Java damlevlim(…) method that’s providing the spelling suggestions at spider.my

Here is a class file with the static methods in, there’s no license expressed or implied on these, so take the method you want and embed it in your code, after testing it with the main() method provided.

Java Levenshtein / Damerau-Levenshtein methods

2009 Apr 16 update – see the Damerau Levenshtein page for latest version

  1. 1 Trackback(s)

  2. Mar 30, 2009: Sean’s blog » Blog Archive » Damerau-Levenshtein in Java

Post a Comment