LA Times reports Michael Jackson dead, HEAD missing
June 26th, 2009 | by Sean |I was manually adding some reports of Michael Jackson’s death to the crawl queue at spider.my this morning, when I noticed that one of the machines doing indexing had choked on a page. It wasn’t long ago that I added some code to detect <meta> elements in pages being used to specify the character set for the page. The regular expression that I’d come up with to extract theĀ content-type from the http-equiv header field just seemed to be looping, using 100% CPU.
I wrote some code that was a little more pedestrian, that I hoped would be more robust. It failed to find the content-type that I could see in the page source, but at least it didn’t hang! It took me a little while to work out that it was caused by the page having no <head> element. I’m relatively new to developing for the Web, so it’s sometimes surprising it all works so well, when you see the quality of the data!
Just for completeness, here’s the cached copy of the LA Times page from spider.my’s page cache. Have a look at the page source. The LA Times page begins about 6 lines down with an HTML element. See? No HEAD.