September 8th, 2008
I tested the spider at spider.my a few times recently. It was previously restricted to just a few sites that their respective admins had kindly volunteered. One of the immediate problems I noticed with releasing the spider in the wild was the number of pages I was mangling between storage and subsequent retrieval from the database: a character set problem, more on that later.
A more salient point, especially since I’m at an early stage of implementing spider.my, is the inclusion of pages I have no ability to read. The first example is the Wu language wikipedia homepage – while not devoid of english (interesting, thank you, from an ashamed monoglot!), I have no idea whether I’m indexing this page in any useful way. If the admin of a non-english website saw my spider fetching pages and wanted to check I was doing something useful with them, I have no way of ensuring I’m not making a complete mess of them.
Accept: header field and HTTP 406
So I checked on The Web Robots Pages,where they have some guidelines for robot writers. One of the guidelines is to “Ask for what you want”, and suggests using the HTTP Accept: header field. W3.org says if the Accept header field is specified by the client, and the server has no acceptable response, then it SHOULD send an HTTP 406 “Not Acceptable” response. It appears at first glance that this should solve a couple of my spder’s problems.
My spider also currently checks the ‘Content-Type’ header in responses and closes the socket if the content type is on a blacklist. That’s a bad behaviour, and possibly doesn’t accomplish what I intend it to. Not only is the server having to do the work of preparing the unwanted resource, the resource may be transferred from the origin server to my spider’s server by the underlying network drivers. My spider is almost certainly reading from a locally-buffered copy of the remote resource. Closing the socket without copying the resource may also cause problems for the remote server if the transfer is not complete.
Doesn’t work
I tried the Accept and Accept-Language header fields out on a few sites that I thought it might improve matters on. I was hoping for a HTTP 406 response, but I never got one – every test resulted in a HTTP 200 reply! I was testing this by hand using cURL, so my test is limited to just a few dozen attempts. Most servers gave no indication that they had parsed the Accept header field in the request at all. The few that did responded by returning the unacceptable resource anyway with alternatives specified in a ‘Content-Location‘ header field.
I’m giving up on this one for a while. Next attempt will be to stop URLs entering the crawl queue based on file suffix. I’m in no hurry to implement this – it reminds me of MS-DOS. There are reasonable ways of establishing a resource’s type. Preserving ancient kludges is not high on my agenda.
You don’t want it? Eat it!
The servers that seem to be gaily ignoring the Accept fields are not doing anything particularly wrong, according to w3.org. Note that w3.org says ‘SHOULD’ about the HTTP 406 response. Furthermore, on a Common HTTP Implementation Problems – when negotiation fails page at w3.org, the advice is to provide default or fallback behaviour:
HTTP/1.1 servers are allowed to return responses which are not acceptable
which strikes me as a simple error in the use of the word ‘fails’. In my view, if a client says it will only accept English, and the server only has a French resource, it should say (apologies for my schoolboy French) “je le n’ai pas” and not “vous prendrez ce que je donne”. The former is only the prelude to negotiation, the latter has failed. In my view, the HTTP 406 response would allow the client to attempt a compromise or give up.
Maybe I’m being pedantic, but as far as I can tell, Accept and Accept-Language DO NOT WORK in the way robotstxt.org suggests they should. I don’t think it’s a problem with robotstxt.org’s interpretation, I think w3.org got the definition right and the implementation guidelines wrong. Of course, that’s the robot-writer’s view. A more strict application of the definition would probably result in lots of websites being rendered incorrectly in browsers.
As ever, I’m interested in your comments. And particulary, the URL of a server that will return 406 responses, I would just like to see one!
Posted in Broken, Life, Spider.my | No Comments »