{"id":1453,"date":"2012-10-26T07:21:40","date_gmt":"2012-10-25T23:21:40","guid":{"rendered":"http:\/\/blog.lolyco.com\/sean\/?p=1453"},"modified":"2012-10-26T08:06:52","modified_gmt":"2012-10-26T00:06:52","slug":"getting-a-multi-language-page-on-a-chinese-search-engine","status":"publish","type":"post","link":"https:\/\/blog.lolyco.com\/sean\/2012\/10\/26\/getting-a-multi-language-page-on-a-chinese-search-engine\/","title":{"rendered":"Getting a multi-language page on a Chinese search engine"},"content":{"rendered":"<p>I made a multi-language page for <a href=\"http:\/\/lolyco.com\/\">Lolyco.com<\/a> recently that switches between language based on the <a href=\"http:\/\/www.w3.org\/Protocols\/rfc2616\/rfc2616-sec14.html#sec14.4\">Accept-Language<\/a> header sent by the User-Agent or by setting a session variable. There&#8217;s an <a href=\"http:\/\/www.lolyco.com\/home.html?content-language=en\">English version<\/a> (the default), a <a href=\"http:\/\/www.lolyco.com\/home.html?content-language=ms\">Malay version<\/a> and a <a href=\"http:\/\/www.lolyco.com\/home.html?content-language=zh\">Chinese version<\/a>. I&#8217;ve had sites online before that are aggressively crawled by (for example) the <a href=\"http:\/\/www.baidu.com\/search\/spider.html\">BaiDuSpider<\/a> but never seem to appear in search results, even when they contain Chinese content.<\/p>\n<p><a href=\"http:\/\/sogou.com\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-full wp-image-1454\" title=\"Sogou.com logo\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2012\/10\/sogou_logo_l.gif\" alt=\"Sogou.com logo\" width=\"248\" height=\"64\" \/><\/a>The Lolyco page not only changes its content but also sets its <a href=\"http:\/\/www.w3.org\/Protocols\/rfc2616\/rfc2616-sec14.html#sec14.12\">Content-Language<\/a> and <a href=\"http:\/\/www.w3.org\/Protocols\/rfc2616\/rfc2616-sec14.html#sec14.44\">Vary<\/a> response headers. Returning different content on the same URI is not something I&#8217;m entirely happy with, but returning content tailored to User-Agents is not uncommon.<\/p>\n<p>The page has so far been crawled by obviously Chinese crawlers BaiDuSpider, <a href=\"http:\/\/www.youdao.com\/help\/webmaster\/spider\/\">YodaoBot<\/a> and <a href=\"http:\/\/www.sogou.com\/docs\/help\/webmasters.htm#07\">Sogou Spider<\/a>. In place of a logo, &#8216;Lolyco&#8217; is a h1 element whose content is changed for Chinese-requesting user agents to &#8216;\u7edc\u7acb\u79d1&#8217; (In pinyin &#8220;l\u00f9o l\u00ec k\u0113&#8221;: no meaning, just sounds like &#8220;lo ly co&#8221;). After a few days, I can see <a href=\"http:\/\/www.sogou.com\/\">Sogou<\/a> is returning a <a href=\"http:\/\/www.sogou.com\/web?query=%22%C2%E7%C1%A2%BF%C6%22&amp;_asf=www.sogou.com&amp;_ast=1351162616&amp;w=01019900&amp;p=40040100&amp;sut=10083&amp;sst0=1351162616470\">result for lolyco.com<\/a> and has a <a href=\"http:\/\/www.sogou.com\/websnapshot?&amp;url=http%3A%2F%2Flolyco.com%2F&amp;did=62fca460d71fa81e-120b4a7ff749015e-65243f46082d93983d44c84cd9358f78&amp;k=043d71111b8620ffa4bc860e5ceff340&amp;encodedQuery=%22%C2%E7%C1%A2%BF%C6%22&amp;query=%22%C2%E7%C1%A2%BF%C6%22&amp;&amp;p=40040100&amp;dp=1&amp;w=01020400&amp;m=0&amp;st=0\">cached copy of the Chinese page<\/a>. <a href=\"http:\/\/www.soso.com\/\">Soso<\/a> also has a <a href=\"http:\/\/www.soso.com\/q?pid=s.idx&amp;cid=s.idx.se&amp;w=%C2%E7%C1%A2%BF%C6\">result for the new Lolyco page<\/a>, though no sign of the <a href=\"http:\/\/help.soso.com\/webspider.htm\">Soso spider<\/a> in the log. Nothing yet at <a href=\"http:\/\/www.baidu.com\/\">Baidu<\/a> nor <a href=\"http:\/\/www.youdao.com\/\">Youdao<\/a>.<\/p>\n<p><a href=\"http:\/\/soso.com\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright  wp-image-1470\" title=\"Soso.com Chinese search\" src=\"http:\/\/blog.lolyco.com\/sean\/wp-content\/uploads\/2012\/10\/soso-logo_index.png\" alt=\"Soso.com Chinese search\" width=\"280\" height=\"70\" \/><\/a>A problem with doing it this way is that <a href=\"http:\/\/www.google.com\/\">Google<\/a> won&#8217;t crawl the non-English pages because for one thing it does not send an Accept-Language header and for another it does not use cookies. Those are some sensible crawler design choices. The Lolyco page defaults to English in the absence of an Accept-Language header. What Google needs (and I&#8217;m not a great fan of web standards being imposed by Google&#8217;s business requirements) is one URL per indexable entity.<\/p>\n<p>The same issue exists with non-English search engines. A little twiddle of my <a href=\"http:\/\/projects.apache.org\/projects\/http_server.html\">Apache<\/a> (I use it as a proxy) <a href=\"http:\/\/httpd.apache.org\/docs\/2.2\/mod\/mod_log_config.html#logformat\">LogFormat<\/a> to add on the Accept-Language header value to my log lines tells me that Sogou Spider requests &#8220;<a href=\"http:\/\/www.w3.org\/International\/questions\/qa-choosing-language-tags#langsubtag\">zh-cn<\/a>&#8221; content. Sogou has the Chinese content in its results, but will never see the English or Malay content.<\/p>\n<p>I reluctantly added a list of anchors to the Lolyco page that override the Accept-Language header with a &#8220;content-language&#8221; query parameter. The list is made invisible to human visitors with a <a href=\"http:\/\/www.w3.org\/wiki\/CSS\/Properties\/display\">CSS display:none<\/a>, but should allow crawlers to fetch the other language versions of the page.<\/p>\n<p>I&#8217;m not entirely confident I&#8217;ll be able to get search results from all search engines in all languages. Some of the crawlers seem to be restrictive: the zh-cn Accept-Language header on some Chinese crawlers makes me think they are only looking for Chinese content. Google doesn&#8217;t specify the header so will probably index the 3 languages indifferently. If you believe the hype then Baidu is the major Chinese search engine, which makes me wonder why its developers (if there are more than one) have chosen to send an <strong>Accept-Language header of &#8216;en-US&#8217;<\/strong>.\u00a0 \u4e3a\u4ec0\u4e48\u767e\u5ea6\uff1f<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I made a multi-language page for Lolyco.com recently that switches between language based on the Accept-Language header sent by the User-Agent or by setting a session variable. There&#8217;s an English version (the default), a Malay version and a Chinese version. I&#8217;ve had sites online before that are aggressively crawled by (for example) the BaiDuSpider but [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,36,3],"tags":[24,41,4,34,18,57],"class_list":["post-1453","post","type-post","status-publish","format-standard","hentry","category-google","category-search","category-software","tag-apache","tag-http","tag-lolycocom","tag-network","tag-server","tag-web"],"_links":{"self":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/1453","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/comments?post=1453"}],"version-history":[{"count":8,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/1453\/revisions"}],"predecessor-version":[{"id":1469,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/posts\/1453\/revisions\/1469"}],"wp:attachment":[{"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/media?parent=1453"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/categories?post=1453"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lolyco.com\/sean\/wp-json\/wp\/v2\/tags?post=1453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}