{"id":73,"date":"2008-08-06T01:44:25","date_gmt":"2008-08-05T17:44:25","guid":{"rendered":"http:\/\/blog.lolyco.com\/sean\/?p=73"},"modified":"2011-01-04T17:18:29","modified_gmt":"2011-01-04T09:18:29","slug":"search-suggestions-with-mysql-levenshtein-distance","status":"publish","type":"post","link":"https:\/\/blog.lolyco.com\/sean\/2008\/08\/06\/search-suggestions-with-mysql-levenshtein-distance\/","title":{"rendered":"Search suggestions with MYSQL Levenshtein distance"},"content":{"rendered":"<p>Up until a couple of days ago, the <a title=\"Spider.my\" href=\"http:\/\/spider.my\/\">Spider.my<\/a> <strong>search<\/strong> page applied only minimal formatting to hits, and no extra page decoration. The &#8216;<strong>blank screen of death<\/strong>&#8216; appeared for a <strong>search miss<\/strong>: no results, no page content.<\/p>\n<p>Just to <strong>reassure the user<\/strong> the empty page was &#8216;normal&#8217;, I&#8217;d put a <strong>lame<\/strong> message up saying &#8220;No results, try something like &#8216;<strong>uk sun<\/strong>&#8216;&#8221;. As soon as I&#8217;d made the recent searches page and saw some <a title=\"Bahasa Malaysia at Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Malay_language\">Bahasa Malaysia <\/a>searches on there, I realised that must <strong>look terrible<\/strong>!<\/p>\n<p>I recalled the <strong>SOUNDEX()<\/strong> function from donkey&#8217;s years ago, but quickly gave up on using that. SOUNDEX() produces a code that can be used to match strings <strong>phonetically<\/strong>, but a simple spelling mistake can produce a large difference in SOUNDEX() result. In addition, two very different words may share the same SOUNDEX() result, for example: &#8216;perhaps&#8217; and &#8216;privacy&#8217;.<\/p>\n<p>A quick search turned up <strong><a title=\"Levenshtein distance at Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">Levenshtein Distance<\/a><\/strong>, also know as &#8216;edit distance&#8217;. This is a measure of the minimum edits that must be made to one word to change it into another. You can think of it as the number of mistakes needed to turn what you intended to type into what you actually typed! Thought of this way, it seems to be a <strong>good candidate<\/strong> for search key <strong>suggestions<\/strong>.<\/p>\n<p>Spider.my uses a <a title=\"MySQL homepage\" href=\"http:\/\/www.mysql.com\/\">MySQL<\/a> backend, and there are Levenshtein implementations available online as stored procedures, but I chose to go with <a title=\"Josh Drew's homepage\" href=\"http:\/\/joshdrew.com\/\">source code from Josh Drew&#8217;s website<\/a>. Compiling and installing the code on my Slackware server was much easier than Josh&#8217;s instructions made me think it would be. I got some <a title=\"Levenshtein tips at MySQL.com\" href=\"http:\/\/dev.mysql.com\/doc\/refman\/5.0\/en\/udf-compiling.html\">extra hints from a page at mysql.com<\/a>.<\/p>\n<p>All I had to do was to compile:<\/p>\n<blockquote><p><code>g++ -I \/usr\/include\/mysql\/ -o libmysqllevenshtein.so -shared\u00a0 mysqllevenshtein.cc<\/code><\/p><\/blockquote>\n<p>The source file doesn&#8217;t immediately look like C++, so if you compile it with gcc, you&#8217;ll get an error about<br \/>\n&#8220;__gxx_personality_v0 undefined symbol&#8221;. I found the tip to use g++ instead on <a href=\"http:\/\/gcc.gnu.org\/ml\/gcc-help\/2002-07\/msg00186.html\">http:\/\/gcc.gnu.org\/ml\/gcc-help\/2002-07\/msg00186.html<\/a> and <a href=\"http:\/\/mail.python.org\/pipermail\/python-list\/2002-June\/149504.html\">http:\/\/mail.python.org\/pipermail\/python-list\/2002-June\/149504.html.<\/a><\/p>\n<p>Copy the shared object to the directory where the other mysql libs are:<\/p>\n<blockquote><p><code>cp libmysqllevenshtein.so \/usr\/lib\/mysql\/<\/code><\/p><\/blockquote>\n<p>And update the library links and cache:<\/p>\n<blockquote><p><code>cd \/usr\/lib<\/code><\/p>\n<p><code>ln -s mysql\/libmysqllevenshtein.so .<\/code><\/p>\n<p><code>ldconfig<\/code><\/p><\/blockquote>\n<p>Then in mysql:<\/p>\n<blockquote><p><code>CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';<\/code><\/p><\/blockquote>\n<p>That&#8217;s it! Now a query like:<\/p>\n<blockquote><p><code>select levenshtein('privacy', 'rpivacy');<\/code><\/p><\/blockquote>\n<p>returns &#8216;2&#8217; &#8211; the minimum number of edits to change &#8216;rpivacy&#8217; into &#8216;privacy&#8217;. <span style=\"text-decoration: line-through;\">Now the <a title=\"Search at Spider.my\" href=\"\">search at spider.my<\/a> is much more friendly when your typing goes bad!<\/span> (2011 Jan 4th &#8211; just saw this link turn up as a 404. There&#8217;s not currently <a href=\"http:\/\/spider.my\/\">a spell checker at spider.my<\/a>, but there may be one day soon.<\/p>\n<div id=\"attachment_79\" style=\"width: 510px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/spider.my\/Search\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-79\" class=\"size-full wp-image-79\" title=\"spidermylevenshtein\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2008\/08\/spidermylevenshtein.jpeg\" alt=\"Suggested word for typo in search box\" width=\"500\" height=\"155\" \/><\/a><p id=\"caption-attachment-79\" class=\"wp-caption-text\">Suggested word for typo in search box<\/p><\/div>\n<p>Some updates on this topic are available:<br \/>\n<a href=\"http:\/\/blog.lolyco.com\/sean\/2008\/08\/27\/damerau-levenshtein-algorithm-levenshtein-with-transpositions\/\">Damerau-Levenshtein algorithm: Levenshtein with transpositions<\/a><br \/>\n<a href=\"http:\/\/blog.lolyco.com\/sean\/2008\/10\/03\/damerau-levenshtein-in-java\/\">Damerau-Levenshtein in Java<\/a><\/p>\n<p>2009 Apr 16 update &#8211; see <a title=\"Damerau Levenshtein page\" href=\"http:\/\/blog.lolyco.com\/sean\/damerau-levenshtein\/\">the Damerau Levenshtein page<\/a> for latest version<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Up until a couple of days ago, the Spider.my search page applied only minimal formatting to hits, and no extra page decoration. The &#8216;blank screen of death&#8216; appeared for a search miss: no results, no page content. Just to reassure the user the empty page was &#8216;normal&#8217;, I&#8217;d put a lame message up saying &#8220;No [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,20,11,3,35],"tags":[9,22],"class_list":["post-73","post","type-post","status-publish","format-standard","hentry","category-breaktime","category-fixed","category-google","category-software","category-spidermy","tag-open-source","tag-slackware"],"_links":{"self":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/73","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/comments?post=73"}],"version-history":[{"count":13,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/73\/revisions"}],"predecessor-version":[{"id":1031,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/73\/revisions\/1031"}],"wp:attachment":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/media?parent=73"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/categories?post=73"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/tags?post=73"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}