Is It Time For a Web Crawling Code of Conduct? (expanded)

We've got a guest post up over at ReadWriteWeb entitled "Is it Time for a Web Crawling Code of Conduct?"  The RWW post provides a summary of how web crawling can be beneficial.  Here are some more specifics on the items mentioned:

Listening to Customers Better:
We have several customers at 80legs that use our service to collect customer reviews from various shopping websites.  This data is aggregated and scored by our clients to provide services such as media monitoring and one-stop shopping portals.

How it helps individuals:  Companies need aggregate data from the web to learn what people think about their products.  Companies that can listen better can meet the needs of their customers better.  One-stop shopping portals provide individuals with an easy way to compare prices and save money on what they’re buying

Delivering Better Ads:
An interesting use-case for web crawling is discovering and analyzing potential ad channels.  Ad networks crawl millions of web pages to find content relevant to their ad inventory.

How it helps individuals:  I’ll admit this is somewhat derivative, but I think everyone would prefer relevant ads or irrelevant on a web page, given that choice.  It should also be noted that web crawling by ad networks means even tiny blogs by individuals can get better ads with higher CTRs.

Building Better Data Sets:
Companies like Infochimps and Factual use web crawling to build better, more structured data sets from information scattered around the web.  This can be anything from property data to sports data.  Rather than having this data scattered around the web, it’s not centralized for easy consumption and analysis.

How it helps individuals:  Again, the benefit is not immediate, but it’s there.  You’ll see Factual datasets being used inline with the content of various websites, enhancing your information experience.  As Infochimps grows their dataset store, you’ll have a great resource for dataset searching.

These are just three examples.  At 80legs, we have dozens of customer verticals, and all of them contain customers that are building fascinating applications on top of web crawling.

Comments (7)

Nov 10, 2010
stargirls said...
London Escorts - http://www.stargirls-escorts.com
Mar 29, 2011
_DxoxE_ said...
A much more urgent issue is: How can a website owner earn money when its content is used in a corporation and leads in a higher profit for them?

Therefore I think crawling the web shouldn't be free for web crawlers. Especially not for those crawlers that profit from using free content that has been crawled. Why? -- Because the owner of an excellent blog article or a worldshaking idea wants to decide who will use it.
Crawlers that just collect free information to make it un-free and to sell it to any third party is a shame. Why don't they want to give something back for free what they got for free?

Mar 29, 2011
Shion Deysarkar said...
Letting content creators decide how content is spread leads to restrictive deals that impede technological (and even societal) progress.

I'd be open to looking at options for paying for crawled content, but I wouldn't be open to arrangements that excessively limit the use of data. We don't want something like we've seen in music and film where ground-breaking services like Pandora and Netflix are hampered by outdated licensing arrangements.

Mar 30, 2011
_DxoxE_ said...
> Letting content creators decide how content is spread [...]

But that's what advertisers already do. They decide what ads I get according to the profile they built regarding my personal buying behaviour. I cannot decide if a I'd like to get ads or from whom I'd like to get ads and what kind of ads. It's even not possible to stop them collecting data about my web surfing or purchasing behaviour.

A blog or content writer has to have the same right as an advertiser or a corporation. I want to decide myself who is trustworthy enough to receive my content (or parts of it) and not the receiving party. And I want to decide if my content is free or not and for who it will be free.
That's like consumer targeted advertising. But in this case the consumer is you. You are interested in my content.

Even if you are not satisfied with my point of view and even if you don't support that, I have the final say. I can ban any webspider from accessing my content. And I'm not the only one who is doing exactly that with some web spiders.

> [...] leads to restrictive deals that impede technological (and even societal) progress.

According to my statement you're questioning yourself with that (false) assertion.

By the way, while you're talking about licensing -- copying, using and/or selling content that has been downloaded from the web is a kind of a copyright infringement, even when the content is unrestricted accessible. No matter if its an mp3, a video or a few words -- it's all just data or digitized content. That's why a blog owner's content is copyrighted by law as soon as it is posted.
And as long as there is a DMCA which leads to punishment when downloading some 'illegal' music or movies, why shouldn't it be possible to sue those who 'illegally' download, use and/or sell blog articles from the web?

I'm really not a friend of copyrights, licences, patents, acts and laws because they impede innovation (why investing in a new product when a patent can be renewed by just changing a component in the existing product? Why developing a new software when its easier getting returning fees for renewing the time limited licence?). But now we have that and therefore not just the average Joe has to accept that but also the big companies.

Mar 30, 2011
Shion Deysarkar said...
I think you may be assuming that the original content is being used without any modification, analysis or work derived from it. That is almost never the case. The reason our users get data (they see content as data) is typically to perform aggregated analysis on it. It's not to reproduce it somewhere.

Derived works, or works that provide additional value, are not protected by copyright, if I remember correctly (IANAL), so I don't think that argument holds.

Jan 20, 2012
London Escorts http:www.firstcallescorts.co.uk/

Leave a comment...