Getting a multi-language page on a Chinese search engine

October 26th, 2012 | by Sean |

I made a multi-language page for Lolyco.com recently that switches between language based on the Accept-Language header sent by the User-Agent or by setting a session variable. There’s an English version (the default), a Malay version and a Chinese version. I’ve had sites online before that are aggressively crawled by (for example) the BaiDuSpider but never seem to appear in search results, even when they contain Chinese content.

Sogou.com logoThe Lolyco page not only changes its content but also sets its Content-Language and Vary response headers. Returning different content on the same URI is not something I’m entirely happy with, but returning content tailored to User-Agents is not uncommon.

The page has so far been crawled by obviously Chinese crawlers BaiDuSpider, YodaoBot and Sogou Spider. In place of a logo, ‘Lolyco’ is a h1 element whose content is changed for Chinese-requesting user agents to ‘络立科’ (In pinyin “lùo lì kē”: no meaning, just sounds like “lo ly co”). After a few days, I can see Sogou is returning a result for lolyco.com and has a cached copy of the Chinese page. Soso also has a result for the new Lolyco page, though no sign of the Soso spider in the log. Nothing yet at Baidu nor Youdao.

Soso.com Chinese searchA problem with doing it this way is that Google won’t crawl the non-English pages because for one thing it does not send an Accept-Language header and for another it does not use cookies. Those are some sensible crawler design choices. The Lolyco page defaults to English in the absence of an Accept-Language header. What Google needs (and I’m not a great fan of web standards being imposed by Google’s business requirements) is one URL per indexable entity.

The same issue exists with non-English search engines. A little twiddle of my Apache (I use it as a proxy) LogFormat to add on the Accept-Language header value to my log lines tells me that Sogou Spider requests “zh-cn” content. Sogou has the Chinese content in its results, but will never see the English or Malay content.

I reluctantly added a list of anchors to the Lolyco page that override the Accept-Language header with a “content-language” query parameter. The list is made invisible to human visitors with a CSS display:none, but should allow crawlers to fetch the other language versions of the page.

I’m not entirely confident I’ll be able to get search results from all search engines in all languages. Some of the crawlers seem to be restrictive: the zh-cn Accept-Language header on some Chinese crawlers makes me think they are only looking for Chinese content. Google doesn’t specify the header so will probably index the 3 languages indifferently. If you believe the hype then Baidu is the major Chinese search engine, which makes me wonder why its developers (if there are more than one) have chosen to send an Accept-Language header of ‘en-US’.  为什么百度?

Post a Comment