Damerau-Levenshtein in Java

October 3rd, 2008

When I wrote about the Damerau Levenshtein MySQL UDF (User Defined Function) I explained that I prototyped the functions in Java first. I had a request for the Java code, but when I looked at it, realised that I only used it half way, and then made finishing touches to the C code directly.

Here is my Levenshtein.java, a class containing 3 static methods: lev(String, String) - the standard Levenshtein function, levlim(String, String, int) - an optimised Levenshtein that returns as soon as the distance between the two words grows beyond a limit, and damlev(String, String) - Damerau-Levenshtein distance between the two words.

The class has a main function, so you can run it directly. Besides the limit optimisation on levlim(), there’s no real optimisation done anywhere, so there should be plenty to go at. In particular, a new rectangular array is created on each invocation of any of the methods.

Possibly the most important thing to warn is this: I’ve only used this code for testing ideas, and it has never been used in any application!

So, fill your boots, here is Levenshtein.java.


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn

What not to GET. Limiting what robots will request.

September 8th, 2008

I tested the spider at spider.my a few times recently. It was previously restricted to just a few sites that their respective admins had kindly volunteered. One of the immediate problems I noticed with releasing the spider in the wild was the number of pages I was mangling between storage and subsequent retrieval from the database: a character set problem, more on that later.

A more salient point, especially since I’m at an early stage of implementing spider.my, is the inclusion of pages I have no ability to read. The first example is the Wu language wikipedia homepage - while not devoid of english (interesting, thank you, from an ashamed monoglot!), I have no idea whether I’m indexing this page in any useful way. If the admin of a non-english website saw my spider fetching pages and wanted to check I was doing something useful with them, I have no way of ensuring I’m not making a complete mess of them.

Accept: header field and HTTP 406

So I checked on The Web Robots Pages,where they have some guidelines for robot writers. One of the guidelines is to “Ask for what you want”, and suggests using the HTTP Accept: header field. W3.org says if the Accept header field is specified by the client, and the server has no acceptable response, then it SHOULD send an HTTP 406 “Not Acceptable” response. It appears at first glance that this should solve a couple of my spder’s problems.

My spider also currently checks the ‘Content-Type’ header in responses and closes the socket if the content type is on a blacklist. That’s a bad behaviour, and possibly doesn’t accomplish what I intend it to. Not only is the server having to do the work of preparing the unwanted resource, the resource may be transferred from the origin server to my spider’s server by the underlying network drivers. My spider is almost certainly reading from a locally-buffered copy of the remote resource. Closing the socket without copying the resource may also cause problems for the remote server if the transfer is not complete.

Doesn’t work

I tried the Accept and Accept-Language header fields out on a few sites that I thought it might improve matters on. I was hoping for a HTTP 406 response, but I never got one - every test resulted in a HTTP 200 reply! I was testing this by hand using cURL, so my test is limited to just a few dozen attempts. Most servers gave no indication that they had parsed the Accept header field in the request at all. The few that did responded by returning the unacceptable resource anyway with alternatives specified in a ‘Content-Location‘ header field.

I’m giving up on this one for a while. Next attempt will be to stop URLs entering the crawl queue based on file suffix. I’m in no hurry to implement this - it reminds me of MS-DOS. There are reasonable ways of establishing a resource’s type. Preserving ancient kludges is not high on my agenda.

You don’t want it? Eat it!

The servers that seem to be gaily ignoring the Accept fields are not doing anything particularly wrong, according to w3.org. Note that w3.org says ‘SHOULD’ about the HTTP 406 response. Furthermore, on a Common HTTP Implementation Problems - when negotiation fails page at w3.org, the advice is to provide default or fallback behaviour:

HTTP/1.1 servers are allowed to return responses which are not acceptable

which strikes me as a simple error in the use of the word ‘fails’. In my view, if a client says it will only accept English, and the server only has a French resource, it should say (apologies for my schoolboy French) “je le n’ai pas” and not “vous prendrez ce que je donne”. The former is only the prelude to negotiation, the latter has failed. In my view, the HTTP 406 response would allow the client to attempt a compromise or give up.

Maybe I’m being pedantic, but as far as I can tell, Accept and Accept-Language DO NOT WORK in the way robotstxt.org suggests they should. I don’t think it’s a problem with robotstxt.org’s interpretation, I think w3.org got the definition right and the implementation guidelines wrong. Of course, that’s the robot-writer’s view. A more strict application of the definition would probably result in lots of websites being rendered incorrectly in browsers.

As ever, I’m interested in your comments. And particulary, the URL of a server that will return 406 responses, I would just like to see one!


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn

Damerau-Levenshtein algorithm: Levenshtein with transpositions

August 27th, 2008

I’m still working away slowly at Spider.my, and spotted a funny loop in the search suggestions:

Search for Teusday - a mis-spelling

Search for Teusday - how about Thursday?

The helpful hint is “maybe ‘thursday’ would get more results?”. I’m using a simple Levenshtein distance algorithm to provide hints when only a few results appear. Because spider.my has very few sites in its database at the moment, nearly every search produces a handy (actually somewhere between funny and annoying!) hint. The basic Levenshtein algorithm doesn’t test for transpositions - when 2 keys are pressed in the wrong order. A test like “levenshtein(’teusday’, ‘tuesday’)” will give an answer of 2, for the two substitutions. Similarly, “levenshtein(’teusday’, ‘thursday’)” gives a result of 2, for the substitution of ‘e’ with ‘h’ and the insertion of ‘r’. My hint code orders results by levenshtein distance only, so ‘thursday’ appears before ‘tuesday’ in the results due to caprice on the part of MySQL.

I’m sure there’s a better reason than caprice, but I don’t often use the word, and I don’t know why the order is that way, so … caprice it is. I could also order my results by similarity in length to the original keyword. That’s a sensible idea that only just struck me, but it didn’t at the crucial time, so I wrote a lot of code instead.

Clicking on the helpful hint above starts the circle:

Not many results for Tuesday, how about Thursday?

Not many results for Thursday, how about Tuesday?

Again, not many results due to an empty database, and the hint code excludes suggesting the same word, so the hint sends us back to Tuesday.  Tuesday and Thursday have the same Levenshtein distance from each other, so around we go. This loop isn’t likely to go away with an improved distance function: a ‘closest’ suggestion for some word will itself suggest the original word if distance is the only criterion.

The problem on spider.my is a consequence of the near-empty database. If more results were returned from a search, the suggestion would only appear for an uncommon (or mis-spelt) keyword. If a more common keyword is suggested, no suggestion will be made.

I read on Wikipedia that the Levenshtein function doesn’t test for transpositions, and my original faulty keyword ‘teusday’ is a good example: the ‘e’ and the ‘u’ are transposed - they are each in the other’s position. The Wikipedia article links to another Wikipedia article with an algorithm that does include transposition checking - Damerau Levenshtein.

While I was editing Josh Drew’s Levenshtein code, I made a few other changes. Maybe they’re not to everybody’s taste, so here are 3 different versions of my Damerau-Levenshtein function for MySQL based on Josh Drew’s code:

1. Basic Damerau-Levenshtein distance (damlev(string, string)), minimal changes to original code:

Damerau-Levenshtein distance source code

- slightly more time to execute (about 10%?) than levenshtein().

2. Damerau-Levenshtein distance, with limit (damlev(string, string, int)) to save computation time on very dissimilar strings:

Damerau-Levenshtein distance with limit source code

- can execute in a fraction of the time (from 50 seconds average down to 13 seconds average in one of my tests) of damlev() with low limits. Worst case (when limit > string length) very similar time to damlev(string, string). Best gains on long strings, low limit.

3. Damerau-Levenshtein distance, with limit and performance hack (damlevlim256(string, string, int) to pre-load row and column starts:

Damerau-Levenshtein distance with limit and 256 hack source code

- executes in less than 90% of time of damlevlim in tests. Time saving greater for longer strings, but not much. 256 ( - 1) is maximum length of string arguments (currently DOES NOT TEST for argument length!). The pre-loaded table is 64K in size, because that should be enough for anybody :D

For installation instructions, see my search suggestions with MYSQL Levenshtein distance article. Let me know how you get on with any or all of those, I hope you find them useful. I wrote the Damerau add-ons in Java first, so I can post those up too if anybody is interested here they are.

Just for comparison:

levenshtein('teusday', 'tuesday') = 2
levenshtein('teusday', 'thursday') = 2
damlev('teusday', 'tuesday') = 1
damlev('teusday', 'thursday') = 2
damlev('tuesday', 'something') = 8
damlevlim('tuesday', 'something', 3) = 3 (returns early)


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn

osCommerce shipping modules: Pos Laju, Pos Malaysia Air Parcel

August 24th, 2008

osCommerce

Just updated these two shipping modules to fix a problem introduced by Pos Malaysia - a change of URL for the postage rates calculator. No change to the calculator, just to the URL. If you’re Pos Malaysia’s web programmer (or you know Pos Malaysia’s web programmer) can you do me a favour and (ask them to) consider the effect of changes on clients? Thanks. That would be lovely.

Download shipping module poslaju v0.08

Download shipping module posmyair v0.09

If you’re using older versions of these modules - they are now broken, and you’ll have to replace them with these updated ones. I occasionally consider providing a more reliable back end for the Pos Malaysia shipping modules - maybe pos.lolyco.com, but I’d want to charge for doing that. But think about it - you’d get something reliable, without arbitrary business-breaking changes, and with responses from a real person, that would make sense. It wouldn’t cost much, maybe just RM20 a year or something. It wouldn’t cost me much either. And then I could offer a shipping module that I don’t have to fix every few months because somebody makes random useless changes to the backend. I could even easily make it faster and more reliable than Pos Malaysia’s version! It could be so good.

Enough ranting, enjoy the new modules!


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn

Writing a search engine widget for firefox / IE7

August 15th, 2008

I had been wondering how difficult it would be to make a search engine widget for spider.my - there’s so much important stuff yet to be done, but I got distracted by the gloss! The great news is - very easy! You can visit spider.my now and add its search engine to your toolbar in Firefox or IE7. If you’re looking at the spider.my homepage, it’s as easy as it looks in this image:

Adding the Search Engine from Spider.my in Firefox

Adding the Search Engine from Spider.my in Firefox

I used the OpenSearch plugin format to create spider.my’s first search engine plugin. There are others, but Mozilla tells me:

Plugins that use the OpenSearch description syntax are compatible with IE 7 and Firefox. Because of this, they are the recommended format for use on the web.

And it is just as easy to add the Search Engine widget from your site for visitors using IE7:

Adding the spider.my Search Engine in IE7

Adding the spider.my Search Engine in IE7

There are two essential ingredients to advertising a search engine widget on your site: an XML document that describes the Search Engine, and a LINK element in the HTML of the pages you’d like the Search Engine to be provided from.

First ingredient - the XML OpenSearch description file

The file is very simple. If, like me, you’re only interested in providing a basic search widget without ’search suggestions’, you need only create a file like:

<OpenSearchDescription
  xmlns="http://a9.com/-/spec/opensearch/1.1/"
  xmlns:moz="http://www.mozilla.org/2006/browser/search/">
<ShortName>engineName</ShortName>
<Description>engineDescription</Description>
<InputEncoding>inputEncoding</InputEncoding>
<Image width="16" height="16">imageURI</Image>
<Url type="text/html" method="method" template="searchURL">
<moz:SearchForm>searchFormURL</moz:SearchForm>
</OpenSearchDescription>

engineName is My Spider for spider.my

engineDescription is My Spider (spider.my) - although I have yet to see this appear anywhere!

inputEncoding is for you to decide. A popular encoding used on the backend of spider.my is UTF-8, so that’s what I use for the Search Engine widget.

imageURI is interesting - I just used a plain old URL to locate the favicon for spider.my, but you can use a data: URI too. The mozilla page includes a link to the data: URI kitchen

method for spider.my is GET and searchURL is the URL your browser sends when you enter some text in the spider.my search page and click ‘Ask My Spider’. To mark the spot where a user’s search input goes, use ‘{searchTerms}’. The searchURL for spider.my is http://spider.my/Search?q={searchTerms}

If it’s really not obvious what to replace with what, compare this template with the actual XML file for spider.my

Second ingredient - telling the user’s browser the widget exists

To notify your users’ browsers, simply add a LINK element to the pages you’d like them to pick up the Search engine from. As I’m only testing out ideas on spider.my, I put the LINK element on the homepage only. I might at a later date add the LINK to all the pages at spider.my (or specifically the Search pages - who knows?).

Here’s the LINK element from spider.my’s homepage:

<link
  rel="search" type="application/opensearchdescription+xml"
  title="My Spider" href="/static/tools/SE.xml">

You need only change the title and href attributes for your own installation. The href tells your visitors browser where you put that XML file you created. That’s it! Next time you load the page from your site, you’ll be able to add your site’s Search Engine to Firefox and IE7 with a simple click!

There are more options that can be added to the search engine, but it seems they’re better supported, or even unique to Firefox. Since I’m aiming spider.my squarely at Web 0.2, I won’t be doing any of that new-fangled stuff.


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn