Proposal for a New Type of Robots.txt
In response to recent discussion on web crawling, Pete Warden put up a great post that discusses what rules should govern web crawlers. I think the main thrust of the post is summarized by:
Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail.
I agree with this. Currently, robots.txt only provides guidance on who can crawl what content. Pete is right to point out that for most webmasters, robots.txt is just a way to tell Google what to do. I'd like to take that line of thinking further and say that most webmasters are only interested in Google crawling them, and furthermore, this is damaging to the data industry as a whole.
At 80legs, we've seen several webmasters tell us they couldn't care less about other web crawlers besides Google. Why? Because they understand the benefit that Google provides them (page views, ad revenue, etc.). They don't see the benefit provided by other web crawlers.
Your immediate response might be "Are there other benefits?" Yes, there are. Here are a couple obvious responses:
- If you're an online store, you may want web crawlers by shopping aggregators to get data from your site so they can help the aggregator build another customer channel for you.
- If you're a blog or some sort of content site, you want web crawlers by ad networks to find your site so they can devlop a more targeted ad channel for you.
There are also some forward-thinking ways of looking at this issue. The large-scale use of web data provides an overall richer experience for end-users. While the use-cases may not be immediately apparent, it's important to not unnecessarily impede the development of new technologies that require web data.
Pete's suggestion that robots.txt be more oriented toward the use of data is a great one. I envision a new robots.txt specification that looks something like this:
User-agent: Google-bot Allow: index // allows web pages to be included in Google's index Allow: archive // web page content can be archived for up to XX number of days Disallow: display-user-content // disallows displaying user-generated content such as reviews, personal information
This would require each web page to be tagged with the type of content it is. While this might seem like a hassle, webmasters are increasingly getting used to tagging content with microdata.
Of course, this suggestion is far from perfect, but I think it's worth developing, especially as the use of web data becomes more prevalent.
