Damerau-Levenshtein algorithm: Levenshtein with transpositions

August 27th, 2008

I’m still working away slowly at Spider.my, and spotted a funny loop in the search suggestions:

Search for Teusday - a mis-spelling

Search for Teusday - how about Thursday?

The helpful hint is “maybe ‘thursday’ would get more results?”. I’m using a simple Levenshtein distance algorithm to provide hints when only a few results appear. Because spider.my has very few sites in its database at the moment, nearly every search produces a handy (actually somewhere between funny and annoying!) hint. The basic Levenshtein algorithm doesn’t test for transpositions – when 2 keys are pressed in the wrong order. A test like “levenshtein(‘teusday’, ‘tuesday’)” will give an answer of 2, for the two substitutions. Similarly, “levenshtein(‘teusday’, ‘thursday’)” gives a result of 2, for the substitution of ‘e’ with ‘h’ and the insertion of ‘r’. My hint code orders results by levenshtein distance only, so ‘thursday’ appears before ‘tuesday’ in the results due to caprice on the part of MySQL.

I’m sure there’s a better reason than caprice, but I don’t often use the word, and I don’t know why the order is that way, so … caprice it is. I could also order my results by similarity in length to the original keyword. That’s a sensible idea that only just struck me, but it didn’t at the crucial time, so I wrote a lot of code instead.

Clicking on the helpful hint above starts the circle:

Not many results for Tuesday, how about Thursday?

Not many results for Thursday, how about Tuesday?

Again, not many results due to an empty database, and the hint code excludes suggesting the same word, so the hint sends us back to Tuesday.  Tuesday and Thursday have the same Levenshtein distance from each other, so around we go. This loop isn’t likely to go away with an improved distance function: a ‘closest’ suggestion for some word will itself suggest the original word if distance is the only criterion.

The problem on spider.my is a consequence of the near-empty database. If more results were returned from a search, the suggestion would only appear for an uncommon (or mis-spelt) keyword. If a more common keyword is suggested, no suggestion will be made.

I read on Wikipedia that the Levenshtein function doesn’t test for transpositions, and my original faulty keyword ‘teusday’ is a good example: the ‘e’ and the ‘u’ are transposed – they are each in the other’s position. The Wikipedia article links to another Wikipedia article with an algorithm that does include transposition checking – Damerau Levenshtein.

While I was editing Josh Drew’s Levenshtein code, I made a few other changes. Maybe they’re not to everybody’s taste, so here are 3 different versions of my Damerau-Levenshtein function for MySQL based on Josh Drew’s code:

1. Basic Damerau-Levenshtein distance (damlev(string, string)), minimal changes to original code:

Damerau-Levenshtein distance source code

– slightly more time to execute (about 10%?) than levenshtein().

2. Damerau-Levenshtein distance, with limit (damlev(string, string, int)) to save computation time on very dissimilar strings:

Damerau-Levenshtein distance with limit source code

– can execute in a fraction of the time (from 50 seconds average down to 13 seconds average in one of my tests) of damlev() with low limits. Worst case (when limit > string length) very similar time to damlev(string, string). Best gains on long strings, low limit.

3. Damerau-Levenshtein distance, with limit and performance hack (damlevlim256(string, string, int) to pre-load row and column starts:

Damerau-Levenshtein distance with limit and 256 hack source code

– executes in less than 90% of time of damlevlim in tests. Time saving greater for longer strings, but not much. 256 ( – 1) is maximum length of string arguments (currently DOES NOT TEST for argument length!). The pre-loaded table is 64K in size, because that should be enough for anybody đŸ˜€

For installation instructions, see my search suggestions with MYSQL Levenshtein distance article. Let me know how you get on with any or all of those, I hope you find them useful. I wrote the Damerau add-ons in Java first, so I can post those up too if anybody is interested here they are. Now on the Damerau Levenshtein page

Just for comparison:

levenshtein('teusday', 'tuesday') = 2
levenshtein('teusday', 'thursday') = 2
damlev('teusday', 'tuesday') = 1
damlev('teusday', 'thursday') = 2
damlev('tuesday', 'something') = 8
damlevlim('tuesday', 'something', 3) = 3 (returns early)

osCommerce shipping modules: Pos Laju, Pos Malaysia Air Parcel

August 24th, 2008

osCommerceMy old-style shipping modules are broken after Pos Malaysia’s last (January 2011) site update. I’m no longer maintaining them, but have replaced them with a single shipping module (now based on an API at shipping-quote.net which is more versatile, faster, more reliable and simpler. See here:

http://blog.lolyco.com/sean/2011/01/19/spider-my-shipping-modules-v0-3-for-oscommerce-zen-cart-pos-malaysia/

Just updated these two shipping modules to fix a problem introduced by Pos Malaysia – a change of URL for the postage rates calculator. No change to the calculator, just to the URL. If you’re Pos Malaysia’s web programmer (or you know Pos Malaysia’s web programmer) can you do me a favour and (ask them to) consider the effect of changes on clients? Thanks. That would be lovely.

Download shipping module poslaju v0.08

Download shipping module posmyair v0.09

If you’re using older versions of these modules – they are now broken, and you’ll have to replace them with these updated ones. I occasionally consider providing a more reliable back end for the Pos Malaysia shipping modules – maybe pos.lolyco.com, but I’d want to charge for doing that. But think about it – you’d get something reliable, without arbitrary business-breaking changes, and with responses from a real person, that would make sense. It wouldn’t cost much, maybe just RM20 a year or something. It wouldn’t cost me much either. And then I could offer a shipping module that I don’t have to fix every few months because somebody makes random useless changes to the backend. I could even easily make it faster and more reliable than Pos Malaysia’s version! It could be so good.

Enough ranting, enjoy the new modules!

Writing a search engine widget for firefox / IE7

August 15th, 2008

I had been wondering how difficult it would be to make a search engine widget for spider.my – there’s so much important stuff yet to be done, but I got distracted by the gloss! The great news is – very easy! You can visit spider.my now and add its search engine to your toolbar in Firefox or IE7. If you’re looking at the spider.my homepage, it’s as easy as it looks in this image:

Adding the Search Engine from Spider.my in Firefox

Adding the Search Engine from Spider.my in Firefox

I used the OpenSearch plugin format to create spider.my’s first search engine plugin. There are others, but Mozilla tells me:

Plugins that use the OpenSearch description syntax are compatible with IE 7 and Firefox. Because of this, they are the recommended format for use on the web.

And it is just as easy to add the Search Engine widget from your site for visitors using IE7:

Adding the spider.my Search Engine in IE7

Adding the spider.my Search Engine in IE7

There are two essential ingredients to advertising a search engine widget on your site: an XML document that describes the Search Engine, and a LINK element in the HTML of the pages you’d like the Search Engine to be provided from.

First ingredient – the XML OpenSearch description file

The file is very simple. If, like me, you’re only interested in providing a basic search widget without ‘search suggestions’, you need only create a file like:

<OpenSearchDescription
  xmlns="http://a9.com/-/spec/opensearch/1.1/"
  xmlns:moz="http://www.mozilla.org/2006/browser/search/">
<ShortName>engineName</ShortName>
<Description>engineDescription</Description>
<InputEncoding>inputEncoding</InputEncoding>
<Image width="16" height="16">imageURI</Image>
<Url type="text/html" method="method" template="searchURL">
<moz:SearchForm>searchFormURL</moz:SearchForm>
</OpenSearchDescription>

engineName is My Spider for spider.my

engineDescription is My Spider (spider.my) – although I have yet to see this appear anywhere!

inputEncoding is for you to decide. A popular encoding used on the backend of spider.my is UTF-8, so that’s what I use for the Search Engine widget.

imageURI is interesting – I just used a plain old URL to locate the favicon for spider.my, but you can use a data: URI too. The mozilla page includes a link to the data: URI kitchen

method for spider.my is GET and searchURL is the URL your browser sends when you enter some text in the spider.my search page and click ‘Ask My Spider’. To mark the spot where a user’s search input goes, use ‘{searchTerms}’. The searchURL for spider.my is http://spider.my/Search?q={searchTerms}

If it’s really not obvious what to replace with what, compare this template with the actual XML file for spider.my

Second ingredient – telling the user’s browser the widget exists

To notify your users’ browsers, simply add a LINK element to the pages you’d like them to pick up the Search engine from. As I’m only testing out ideas on spider.my, I put the LINK element on the homepage only. I might at a later date add the LINK to all the pages at spider.my (or specifically the Search pages – who knows?).

Here’s the LINK element from spider.my’s homepage:

<link
  rel="search" type="application/opensearchdescription+xml"
  title="My Spider" href="/static/tools/SE.xml">

You need only change the title and href attributes for your own installation. The href tells your visitors browser where you put that XML file you created. That’s it! Next time you load the page from your site, you’ll be able to add your site’s Search Engine to Firefox and IE7 with a simple click!

There are more options that can be added to the search engine, but it seems they’re better supported, or even unique to Firefox. Since I’m aiming spider.my squarely at Web 0.2, I won’t be doing any of that new-fangled stuff.

Search suggestions with MYSQL Levenshtein distance

August 6th, 2008

Up until a couple of days ago, the Spider.my search page applied only minimal formatting to hits, and no extra page decoration. The ‘blank screen of death‘ appeared for a search miss: no results, no page content.

Just to reassure the user the empty page was ‘normal’, I’d put a lame message up saying “No results, try something like ‘uk sun‘”. As soon as I’d made the recent searches page and saw some Bahasa Malaysia searches on there, I realised that must look terrible!

I recalled the SOUNDEX() function from donkey’s years ago, but quickly gave up on using that. SOUNDEX() produces a code that can be used to match strings phonetically, but a simple spelling mistake can produce a large difference in SOUNDEX() result. In addition, two very different words may share the same SOUNDEX() result, for example: ‘perhaps’ and ‘privacy’.

A quick search turned up Levenshtein Distance, also know as ‘edit distance’. This is a measure of the minimum edits that must be made to one word to change it into another. You can think of it as the number of mistakes needed to turn what you intended to type into what you actually typed! Thought of this way, it seems to be a good candidate for search key suggestions.

Spider.my uses a MySQL backend, and there are Levenshtein implementations available online as stored procedures, but I chose to go with source code from Josh Drew’s website. Compiling and installing the code on my Slackware server was much easier than Josh’s instructions made me think it would be. I got some extra hints from a page at mysql.com.

All I had to do was to compile:

g++ -I /usr/include/mysql/ -o libmysqllevenshtein.so -shared  mysqllevenshtein.cc

The source file doesn’t immediately look like C++, so if you compile it with gcc, you’ll get an error about
“__gxx_personality_v0 undefined symbol”. I found the tip to use g++ instead on http://gcc.gnu.org/ml/gcc-help/2002-07/msg00186.html and http://mail.python.org/pipermail/python-list/2002-June/149504.html.

Copy the shared object to the directory where the other mysql libs are:

cp libmysqllevenshtein.so /usr/lib/mysql/

And update the library links and cache:

cd /usr/lib

ln -s mysql/libmysqllevenshtein.so .

ldconfig

Then in mysql:

CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';

That’s it! Now a query like:

select levenshtein('privacy', 'rpivacy');

returns ‘2’ – the minimum number of edits to change ‘rpivacy’ into ‘privacy’. Now the search at spider.my is much more friendly when your typing goes bad! (2011 Jan 4th – just saw this link turn up as a 404. There’s not currently a spell checker at spider.my, but there may be one day soon.

Suggested word for typo in search box

Suggested word for typo in search box

Some updates on this topic are available:
Damerau-Levenshtein algorithm: Levenshtein with transpositions
Damerau-Levenshtein in Java

2009 Apr 16 update – see the Damerau Levenshtein page for latest version

Like Cuil. But with no funding, no style and no data.

August 2nd, 2008
Cuil.com - search engine

- Mandelbrot

I’ve been working so late recently, I’ve been early. It’s silly, working late. I’ve been sitting here working at a glacial pace, thinking “I must go to sleep”, until I can hear birds singing, the window starts to get bright again, and the neighbours start frying something that might smell tasty if I’d had a long night’s sleep.

I was gutted to see Cuil launch to great fanfare recently. It looks fabulous, though like plenty of others are writing, I’m not at all convinced by the tabular layout of the search results. But it does look nice.

Hey! I keep looking at the design of the homepage and thinking “Mandelbrotty…”, but maybe it’s just a sign I should sleep more. Is it just me?

Mandelbrot

- Cuil.com

Here’s an image stolen from Wikipedia for comparison…

Remember, you read it here first!

The reason I’m gutted is that I’ve been writing a search engine ‘for kicks’ and I’ve been expecting to tell people it’s ready to submit sites to “this Friday“. I’ve been telling my friend Ekompute over at Dummipedia “this Friday” since early June. There’s a world of difference between personal code and production code – I don’t mind picking up the pieces when my projects fall apart, but visitors to a website have higher expectations.

That point about “high expectations” is why I won’t be batting my eyelids at Venture Capitalists any time soon. Cuil may look pretty, and it does seem to have some genuinely new ideas, in its presentation, at least. Have you used it though? It may have “more pages than Google“, but when you try a few searches, you’ll quickly see that size is not the only issue.

Spider.my's excuse-filled homepage

- Spider.my

My project is not really ready for serious use. In fact, it may steal ten minutes of your time and you won’t even think it’s funny. My spider (and search engine (and DIY webserver)) is at spider.my!

I’m not selling it, am I?