{"id":391,"date":"2009-06-26T13:34:28","date_gmt":"2009-06-26T05:34:28","guid":{"rendered":"http:\/\/blog.lolyco.com\/sean\/?p=391"},"modified":"2009-06-26T17:37:24","modified_gmt":"2009-06-26T09:37:24","slug":"la-times-reports-michael-jackson-dead-head-missing","status":"publish","type":"post","link":"https:\/\/blog.lolyco.com\/sean\/2009\/06\/26\/la-times-reports-michael-jackson-dead-head-missing\/","title":{"rendered":"LA Times reports Michael Jackson dead, HEAD missing"},"content":{"rendered":"<div id=\"attachment_392\" style=\"width: 157px\" class=\"wp-caption alignright\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Michael_jackson\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-392\" class=\"size-full wp-image-392\" title=\"Michael Jackson - from Wikipedia\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/06\/mj1984.jpeg\" alt=\"Michael Jackson - from Wikipedia\" width=\"147\" height=\"224\" \/><\/a><p id=\"caption-attachment-392\" class=\"wp-caption-text\">Michael Jackson - from Wikipedia<\/p><\/div>\n<p>I was manually adding some <a title=\"Michael Jackson dead - LA Times\" href=\"http:\/\/latimesblogs.latimes.com\/lanow\/2009\/06\/pop-star-michael-jackson-was-rushed-to-a-hospital-this-afternoon-by-los-angeles-fire-department-paramedics--capt-steve-ruda.html\">reports of Michael Jackson&#8217;s death<\/a> to the crawl queue at <a href=\"http:\/\/spider.my\/\">spider.my<\/a> this morning, when I noticed that one of the machines doing indexing had choked on a page. It wasn&#8217;t long ago that I added some code to detect <a href=\"http:\/\/en.wikipedia.org\/wiki\/Meta_element\">&lt;meta&gt; elements<\/a> in pages being used to specify the character set for the page. The regular expression that I&#8217;d come up with to extract the\u00a0 content-type from the http-equiv header field just seemed to be looping, using 100% CPU.<\/p>\n<p>I wrote some code that was a little more pedestrian, that I hoped would be more robust. It failed to find the content-type that I could see in the page source, but at least it didn&#8217;t hang! It took me a little while to work out that it was caused by the page having no <a title=\"HEAD element at Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/HTML_element#Document_Structure_Elements\">&lt;head&gt; element<\/a>. I&#8217;m relatively new to developing for the Web, so it&#8217;s sometimes surprising it all works so well, when you see the quality of the data!<\/p>\n<div id=\"attachment_397\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/06\/latimes.jpeg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-397\" class=\"size-medium wp-image-397\" title=\"LA Times missing HEAD element\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/06\/latimes-300x98.jpg\" alt=\"LA Times missing HEAD element\" width=\"300\" height=\"98\" srcset=\"https:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/06\/latimes-300x98.jpg 300w, https:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2009\/06\/latimes.jpeg 484w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-397\" class=\"wp-caption-text\">LA Times missing HEAD element<\/p><\/div>\n<p>Just for completeness, <a title=\"LA Times' Michael Jackson death report. HEAD missing\" href=\"http:\/\/spider.my\/Cache?urid=5454246\">here&#8217;s the cached copy of the LA Times page from spider.my&#8217;s page cache<\/a>. Have a look at the page source. The LA Times page begins about 6 lines down with an HTML element. See? No HEAD.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was manually adding some reports of Michael Jackson&#8217;s death to the crawl queue at spider.my this morning, when I noticed that one of the machines doing indexing had choked on a page. It wasn&#8217;t long ago that I added some code to detect &lt;meta&gt; elements in pages being used to specify the character set [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,14,40,35],"tags":[23,107,34,112,105],"class_list":["post-391","post","type-post","status-publish","format-standard","hentry","category-breaktime","category-broken","category-life","category-spidermy","tag-blogging","tag-broken","tag-network","tag-search","tag-software"],"_links":{"self":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/391","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/comments?post=391"}],"version-history":[{"count":10,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/391\/revisions"}],"predecessor-version":[{"id":408,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/391\/revisions\/408"}],"wp:attachment":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/media?parent=391"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/categories?post=391"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/tags?post=391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}