Search suggestions with MYSQL Levenshtein distance

August 6th, 2008 | by Sean |

Up until a couple of days ago, the Spider.my search page applied only minimal formatting to hits, and no extra page decoration. The ‘blank screen of death‘ appeared for a search miss: no results, no page content.

Just to reassure the user the empty page was ‘normal’, I’d put a lame message up saying “No results, try something like ‘uk sun‘”. As soon as I’d made the recent searches page and saw some Bahasa Malaysia searches on there, I realised that must look terrible!

I recalled the SOUNDEX() function from donkey’s years ago, but quickly gave up on using that. SOUNDEX() produces a code that can be used to match strings phonetically, but a simple spelling mistake can produce a large difference in SOUNDEX() result. In addition, two very different words may share the same SOUNDEX() result, for example: ‘perhaps’ and ‘privacy’.

A quick search turned up Levenshtein Distance, also know as ‘edit distance’. This is a measure of the minimum edits that must be made to one word to change it into another. You can think of it as the number of mistakes needed to turn what you intended to type into what you actually typed! Thought of this way, it seems to be a good candidate for search key suggestions.

Spider.my uses a MySQL backend, and there are Levenshtein implementations available online as stored procedures, but I chose to go with source code from Josh Drew’s website. Compiling and installing the code on my Slackware server was much easier than Josh’s instructions made me think it would be. I got some extra hints from a page at mysql.com.

All I had to do was to compile:

g++ -I /usr/include/mysql/ -o libmysqllevenshtein.so -shared  mysqllevenshtein.cc

The source file doesn’t immediately look like C++, so if you compile it with gcc, you’ll get an error about
“__gxx_personality_v0 undefined symbol”. I found the tip to use g++ instead on http://gcc.gnu.org/ml/gcc-help/2002-07/msg00186.html and http://mail.python.org/pipermail/python-list/2002-June/149504.html.

Copy the shared object to the directory where the other mysql libs are:

cp libmysqllevenshtein.so /usr/lib/mysql/

And update the library links and cache:

cd /usr/lib

ln -s mysql/libmysqllevenshtein.so .

ldconfig

Then in mysql:

CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';

That’s it! Now a query like:

select levenshtein('privacy', 'rpivacy');

returns ‘2′ - the minimum number of edits to change ‘rpivacy’ into ‘privacy’. Now the search at spider.my is much more friendly when your typing goes bad!

Suggested word for typo in search box

Suggested word for typo in search box

Some updates on this topic are available:
Damerau-Levenshtein algorithm: Levenshtein with transpositions
Damerau-Levenshtein in Java


Share and Enjoy:
  • Digg
  • del.icio.us
  • Fark
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Mixx
  • Print this article!
  • Sphinn
  1. 1 Trackback(s)

  2. Oct 28, 2008: Sean’s blog » Blog Archive » Damerau-Levenshtein algorithm: Levenshtein with transpositions

Post a Comment