No more home hosting

October 30th, 2008

Apparently we just need to give a few day’s notice to our local TM Point to terminate our Streamyx service. We’ve had it more than 2 years, and upgraded to 1Mbit/s last year. I was hosting a few sites of my own, just little ideas I wanted to work on, using Dynamic DNS from ZoneEdit. My wife hosted her online shop from a server in our home office too, using Custom DNS from DynDNS.com – the best company I have ever dealt with on the Internet.

I have never phoned a service provider as often as I have phoned TM Net. I think I’ve even phoned them more often than I have the local pizza delivery service, so we’re talking a lot of phone calls! Because we host some of our own sites from home, we have a very stable network. It has stayed in one layout for nearly two years, since we moved into this house. There were a few months at at the start of 2008 when nothing really serious seemed to be wrong with Streamyx. The occasional slow-spell is bearable. Other than that, the whole period has been characterised by more or less show-stopping faults. Some days, even the phone was dead!

I don’t want to flog a dead horse, I’ve written about this elsewhere, so suffice to say, it was so unbearably bad, we’re cancelling it and going to DiGi’s EDGE network. That means no more home-hosting, as the Malaysian cellular networks don’t support Dynamic DNS. I’ll have to find something else to while away the months! My wife’s shop has gone to Exabytes, so the Custom DNS is redundant. The Dynamic DNS from ZoneEdit has a handy feature for redirecting URLs, so this page will appear instead of:

  • poditronic.com
  • spider.my
  • aircarfuel.com
  • sendto.my
  • send2.my
  • blog.lolyco.com/sean

…until I get round to making something amazing to host on one of those domains, or unless someone thinks they can do it quicker, and would like to take a domain off my hands!

Streamyx SMTP AUTH mail proxy

October 22nd, 2008

I thought I’d better write a quick up date to the previous article I wrote, since it gets a lot of hits. The information is a bit outdated in the old article, and our problems have moved on since. I changed our sendmail config at lolyco.com some time ago, in response to an exchange of emails with the TM Net mail admins (who are in my experience the most responsive staff at TM). TM have updated their online instructions too. The two most important lines to update are in config.mc:

define(`SMART_HOST',`[smtp-proxy.tm.net.my]')dnl

and in auth/client-info (or whatever it’s called on your system):

AuthInfo:smtp-proxy.tm.net.my "U:username" "P:password" "M:LOGIN"

This seems to have reduced the number of mails we get returned by yahoo and AOL due to smtp.streamyx.com being blacklisted as a spam relay, but is not without its own ‘quirks’.

Since using the new server, we still see a few problems in our logs:

AUTH=client, available mechanisms do not fulfill requirements
stat=Deferred: Temporary AUTH failure

… which is not so bad, as the problem seems to ‘go away’ after a while and mail is eventually delivered.

The ‘domain does not resolve’ problem was a bit of a disaster when it happened several times back in August, when we got this message a few times:

Domain of sender address <notmyname@notmydomain.com> does not resolve. Please register your domain.

We haven’t seen that one lately, but an email to help@tm.net.my at the time seemed to get it speedily fixed.

I hope that’s useful. What else has changed at TM? I see they’ve changed the wording of their Terms and Conditions from “best effort” to “best endeavours”. Glad to see their lawyers are earning their wages – they must have felt ‘effort’ implied something measurable.

Cannot execute /bin/bash: Permission denied

October 21st, 2008

Just a quick note, just in case someone else sees this message. I got this error this morning when I logged in as a non-root user on a Slackware server I’d upgraded yesterday. In an act of blind faith, I used ‘slackpkg upgrade-all’, and then the fun began.

Incidentally, if you’re in Malaysia, and you’re downloading Slackware packages at the 8kB/s which seems to be all TM Net / Streamyx can manage with their “best effort” basis from nearby countries, download from Brazil – I get a reliable 180kB/s on a 1Mbit/s connection. Yes, that’s right folks. TM Net give you 5% of your rated download from any country in smuggling distance, but can max out your connection from the other side of the world! Answers on a postcard, please.

Back to the problem. I tried some fairly obvious fixes to do with checking permissions on directories and executables, none seemed to work. Then I searched online and found http://linuxgazette.net/issue52/okopnik.html where the same problem was traced to the permissions on /lib/ld-linux.so.2, or rather, the permissions on the file it links to. On my system, changing the permissions to 755 on /lib/ld-2.7.so didn’t allow me to login as a non-root user, but did give me errors like:

error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

Much easier to track down! It seems slackpkg had installed several libraries in /lib with insufficient permissions, all with the version string “2.7” in them. All these libraries seem to depend on /lib/ld-linux, according to ldd.

I’d wish you ‘Have fun!’, but my eyes are still throbbing from staring at this error for so long.

Damerau-Levenshtein in Java

October 3rd, 2008

When I wrote about the Damerau Levenshtein MySQL UDF (User Defined Function) I explained that I prototyped the functions in Java first. I had a request for the Java code, but when I looked at it, realised that I only used it half way, and then made finishing touches to the C code directly.

Here is my Levenshtein.java, a class containing 3 static methods: lev(String, String) – the standard Levenshtein function, levlim(String, String, int) – an optimised Levenshtein that returns as soon as the distance between the two words grows beyond a limit, and damlev(String, String) – Damerau-Levenshtein distance between the two words.

The class has a main function, so you can run it directly. Besides the limit optimisation on levlim(), there’s no real optimisation done anywhere, so there should be plenty to go at. In particular, a new rectangular array is created on each invocation of any of the methods.

Possibly the most important thing to warn is this: I’ve only used this code for testing ideas, and it has never been used in any application!

So, fill your boots, here is Levenshtein.java.

Updated with a bugfix March 2009

2009 Apr 16 update – see the Damerau Levenshtein page for latest version

What not to GET. Limiting what robots will request.

September 8th, 2008

I tested the spider at spider.my a few times recently. It was previously restricted to just a few sites that their respective admins had kindly volunteered. One of the immediate problems I noticed with releasing the spider in the wild was the number of pages I was mangling between storage and subsequent retrieval from the database: a character set problem, more on that later.

A more salient point, especially since I’m at an early stage of implementing spider.my, is the inclusion of pages I have no ability to read. The first example is the Wu language wikipedia homepage – while not devoid of english (interesting, thank you, from an ashamed monoglot!), I have no idea whether I’m indexing this page in any useful way. If the admin of a non-english website saw my spider fetching pages and wanted to check I was doing something useful with them, I have no way of ensuring I’m not making a complete mess of them.

Accept: header field and HTTP 406

So I checked on The Web Robots Pages,where they have some guidelines for robot writers. One of the guidelines is to “Ask for what you want”, and suggests using the HTTP Accept: header field. W3.org says if the Accept header field is specified by the client, and the server has no acceptable response, then it SHOULD send an HTTP 406 “Not Acceptable” response. It appears at first glance that this should solve a couple of my spder’s problems.

My spider also currently checks the ‘Content-Type’ header in responses and closes the socket if the content type is on a blacklist. That’s a bad behaviour, and possibly doesn’t accomplish what I intend it to. Not only is the server having to do the work of preparing the unwanted resource, the resource may be transferred from the origin server to my spider’s server by the underlying network drivers. My spider is almost certainly reading from a locally-buffered copy of the remote resource. Closing the socket without copying the resource may also cause problems for the remote server if the transfer is not complete.

Doesn’t work

I tried the Accept and Accept-Language header fields out on a few sites that I thought it might improve matters on. I was hoping for a HTTP 406 response, but I never got one – every test resulted in a HTTP 200 reply! I was testing this by hand using cURL, so my test is limited to just a few dozen attempts. Most servers gave no indication that they had parsed the Accept header field in the request at all. The few that did responded by returning the unacceptable resource anyway with alternatives specified in a ‘Content-Location‘ header field.

I’m giving up on this one for a while. Next attempt will be to stop URLs entering the crawl queue based on file suffix. I’m in no hurry to implement this – it reminds me of MS-DOS. There are reasonable ways of establishing a resource’s type. Preserving ancient kludges is not high on my agenda.

You don’t want it? Eat it!

The servers that seem to be gaily ignoring the Accept fields are not doing anything particularly wrong, according to w3.org. Note that w3.org says ‘SHOULD’ about the HTTP 406 response. Furthermore, on a Common HTTP Implementation Problems – when negotiation fails page at w3.org, the advice is to provide default or fallback behaviour:

HTTP/1.1 servers are allowed to return responses which are not acceptable

which strikes me as a simple error in the use of the word ‘fails’. In my view, if a client says it will only accept English, and the server only has a French resource, it should say (apologies for my schoolboy French) “je le n’ai pas” and not “vous prendrez ce que je donne”. The former is only the prelude to negotiation, the latter has failed. In my view, the HTTP 406 response would allow the client to attempt a compromise or give up.

Maybe I’m being pedantic, but as far as I can tell, Accept and Accept-Language DO NOT WORK in the way robotstxt.org suggests they should. I don’t think it’s a problem with robotstxt.org’s interpretation, I think w3.org got the definition right and the implementation guidelines wrong. Of course, that’s the robot-writer’s view. A more strict application of the definition would probably result in lots of websites being rendered incorrectly in browsers.

As ever, I’m interested in your comments. And particulary, the URL of a server that will return 406 responses, I would just like to see one!