Houston Code Camp: Converting the Internet into a Single Database

Houston's first ever Code Camp will be upon us in August.  We're pretty excited about it at 80legs, since our CEO has been calling for a stronger hacker culture in Houston.  Hopefully this is the first of many quality hacker and developer-oriented events in Houston.

We've submitted a session idea entitled "Converting the Internet into a Single Database: Technologies Used & Lessons Learned" and thought it would be a good idea to provide some more details here on what this session will be about.

The Internet as a Database: What's that Mean?

Consider what's happening when you run a Google search, for say "houston restaurants".  What's really happening here?  You, as an individual, are trying to find a single data point, most likely advice on where to eat in the next few hours.  Google is very good at delivering an answer from the Internet to individuals, but it's not good at deliverinig answers to commercial organizations, or for more complex queries.

Let's say what you really want is "all houston restaurants that may need menu consultation".  (E.g., if you are a kitchen consultant).  You might want to run a query like "Find all Houston businesses that are restaurants where overall rating is < 3.0 out of 5 stars and reviews contain complaints about menu items".  This is a much more complicated query, but the data is available out there.  We just need a way of structuring and querying it.

Enter the Platform

Let's break down how we would build a platform that could serve our restaurant query and many more like it.  Here's what we'd need:

  1. The ability to collect all relevant data on the web (quickly and at-scale)
  2. A standard format for structuring data from different sources
  3. A storage system for all the data (which will probably be several billion records)
  4. A query language for retrieving data from storage
  5. A processing layer for running the query

If you look at these steps, you can start to conceptualize how a technology stack for "the Internet as a database" might look.  During our talk, we'll cover how we addressed and implemented each part of our stack, with a focus on the following questions:

  1. Should we choose to build this component in-house or use an open-source tool?
  2. How did we evaluate open-source tools for our use-case?
  3. How did we keep development of the platform on a rapid iteration cycle?
  4. What did we learn about our technology and business during the devlopment of each component?

We hope folks will come away with a better understanding of how to break down large technology goals into smaller, more manageable components as well as how to evaluate different technologies as they relate to the goal (business or otherwise) at hand.

Hopefully this provides more insight into our proposed talk!  If you have any feedback, please let us know :)  If you'd like to see our talk at Code Camp, please vote for it!

 

Contest Time: IndexTank / 80legs Crawlathon!

We're partnering up with IndexTank to offer a contest that pits you and other hackers against each other to see who can make the best use of 80legs and IndexTank together.  Here are the contest details from IndexTank:

We like contests. A little competition is good. Brains get used, geekiness ensues, good stuff happens. Prizes, who doesn’t love prizes? I’m delighted to announce our third developer contest. Seriously, I’m reveling in paroxystic ecstasy because this one brings me back to the early days of the web. Why? Because you’ll have a chance to create your own little web search engine. No, you don’t need a team of engineers, a ton of dedicated servers and a chef.

IndexTank is teaming up with 80legs so that you can pick a chunk of the web, crawl it and index it. Finally you can create the search engine for Thundercats collectibles that you always wanted! (ok, that was just me).

Get Started:

  • Go to 80legs and sign up. You must use the referral code “contest” to get a contest account.
  • Come to IndexTank and sign up for our special contest account.
  • Create a front-end for your app (web, mobile). We recommend Heroku.
  • Read the contest rules (legalese, yada yada, sleeping aid).

Oh yeah, the prizes:

1) A shiny 11″ MacBook Air. We like them.

2) A Rovio WowWee robot AND Arduino pack from Adafruit.

3) The Art of Computer Programming, including the new volume 4A.

Also, the best Heroku-hosted app wins $100 worth of Heroku credit or a $100 Gift certificate for Amazon (even if it’s not one of the all-around top 3).

Got Questions? Contact us at any time if you have questions through our live chat on our site, #indextank on Freenode (irc) or email us at support@indextank.com.

Important Dates

  • Contest begins: June 15th at 12:01 am, Pacific Daylight Saving Time.
  • Contest ends: June 30th, 2011 at noon, PDT
  • Notification of winners: July 4th, 2011 (fireworks!)

How to Submit Your Winning App

Your application must be live and accessible to our judges by the end of the contest, and you must have completed the contest submission form (link will be posted here and tweeted by @indextank).

Your app will be judged based on:

  • Usefulness, creativity, elegance, efficiency.
  • The extent to which it takes advantage of IndexTank’s features and 80legs’ crawled data.
  • Extra points for making the source code publicly available (GitHub, Google Code, etc.) within 24 hours of the contest deadline.
  • Extra points for using indextank-jquery for your UI.

By Our Expert Judges:

  • Diego Basch, CEO IndexTank (@dbasch)
  • Shion Deysarkar, CEO 80legs (@shiondev)
  • James Lindembaum, Co-Founder, Heroku
  • Othman Laraki, Twitter / GeoAPI (@othman)

Discuss on Hacker News

Proposal for a New Type of Robots.txt

In response to recent discussion on web crawling, Pete Warden put up a great post that discusses what rules should govern web crawlers.  I think the main thrust of the post is summarized by:

Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail.

I agree with this.  Currently, robots.txt only provides guidance on who can crawl what content.  Pete is right to point out that for most webmasters, robots.txt is just a way to tell Google what to do.  I'd like to take that line of thinking further and say that most webmasters are only interested in Google crawling them, and furthermore, this is damaging to the data industry as a whole.

At 80legs, we've seen several webmasters tell us they couldn't care less about other web crawlers besides Google.  Why?  Because they understand the benefit that Google provides them (page views, ad revenue, etc.).  They don't see the benefit provided by other web crawlers.

Your immediate response might be "Are there other benefits?"  Yes, there are.  Here are a couple obvious responses:

  1. If you're an online store, you may want web crawlers by shopping aggregators to get data from your site so they can help the aggregator build another customer channel for you.
  2. If you're a blog or some sort of content site, you want web crawlers by ad networks to find your site so they can devlop a more targeted ad channel for you.

There are also some forward-thinking ways of looking at this issue.  The large-scale use of web data provides an overall richer experience for end-users.  While the use-cases may not be immediately apparent, it's important to not unnecessarily impede the development of new technologies that require web data.

Pete's suggestion that robots.txt be more oriented toward the use of data is a great one.  I envision a new robots.txt specification that looks something like this:

User-agent: Google-bot
Allow: index // allows web pages to be included in Google's index
Allow: archive // web page content can be archived for up to XX number of days
Disallow: display-user-content // disallows displaying user-generated content such as reviews, personal information

This would require each web page to be tagged with the type of content it is.  While this might seem like a hassle, webmasters are increasingly getting used to tagging content with microdata.

Of course, this suggestion is far from perfect, but I think it's worth developing, especially as the use of web data becomes more prevalent.

Is It Time For a Web Crawling Code of Conduct? (expanded)

We've got a guest post up over at ReadWriteWeb entitled "Is it Time for a Web Crawling Code of Conduct?"  The RWW post provides a summary of how web crawling can be beneficial.  Here are some more specifics on the items mentioned:

Listening to Customers Better:
We have several customers at 80legs that use our service to collect customer reviews from various shopping websites.  This data is aggregated and scored by our clients to provide services such as media monitoring and one-stop shopping portals.

How it helps individuals:  Companies need aggregate data from the web to learn what people think about their products.  Companies that can listen better can meet the needs of their customers better.  One-stop shopping portals provide individuals with an easy way to compare prices and save money on what they’re buying

Delivering Better Ads:
An interesting use-case for web crawling is discovering and analyzing potential ad channels.  Ad networks crawl millions of web pages to find content relevant to their ad inventory.

How it helps individuals:  I’ll admit this is somewhat derivative, but I think everyone would prefer relevant ads or irrelevant on a web page, given that choice.  It should also be noted that web crawling by ad networks means even tiny blogs by individuals can get better ads with higher CTRs.

Building Better Data Sets:
Companies like Infochimps and Factual use web crawling to build better, more structured data sets from information scattered around the web.  This can be anything from property data to sports data.  Rather than having this data scattered around the web, it’s not centralized for easy consumption and analysis.

How it helps individuals:  Again, the benefit is not immediate, but it’s there.  You’ll see Factual datasets being used inline with the content of various websites, enhancing your information experience.  As Infochimps grows their dataset store, you’ll have a great resource for dataset searching.

These are just three examples.  At 80legs, we have dozens of customer verticals, and all of them contain customers that are building fascinating applications on top of web crawling.

The Grey Market for Data

Jud Valeski at O'Reilly Radar posted a great piece recently on the "The black market for data".  The part we feel is really worth highlighting has to do with keeping data available:

Despite black markets and TOS violations, it's important for publishers to continue to make their data widely available. Publishers get the public benefit of being labeled as open, as opposed to proprietary. They also effectively outsource many of the hard technical challenges and business models to developers who want to build products based on their data.

It's important to realize that data has value beyond the original intention of its creators, and we should encourage the creation of that value.  We should do what we can to prevent the misuse of the data by some data consumers, but we should make sure that data consumers that are using data to create additional value are encouraged to do so.

Case Study: Discovering Ad Channels by Adify

Good content can be hard to find. As individuals, we spend several hours each week browsing through web pages and trying to find content that's interesting. Advertising networks know that, and they want to make sure the ads they deliver are displayed next to the best and most interesting content. To do this, they crawl the web, trying to identify websites with interesting content. The more websites they know about, the more potential ad channels they have.

Adify is one of the top advertising networks in the country, and they use 80legs to help power their web crawling and analysis of interesting web content as a component of their market mappingTM methodology. To do this, Adify has created their own custom 80legs code to process the content of a web page and determine whether or not the web page or domain provides interesting and relevant content. Over time, they have built several applications on the 80legs platform to tell them whether or not a domain fits potential advertisers' needs. With the scale and customization provided by 80legs, Adify can do this quickly, easily and cost-effectively.

Adify has crawled over 50 million targeted websites with 80legs. When coupled with Adify's proprietary insights and oother industry leading sources of analytics, these crawls help expand Adify's extensive database of websites and create a comprehensive map of potential advertising channels on the web. By mapping the Internet in this manner and creating market mapsTM, Adify is able to provide their customers strategic guidance on content monetization.

Case Study: Sentiment Analysis by Lingway

Sentiment analysis is in big demand these days. Lingway uses natural language processing (NLP) to understand how people feel about various brands. Lingway specializes in processing text data, but they rely on the specialty of 80legs to gather that data from the Web.

Here's how Lingway's workflow handles data extraction and collection:

  • Search engines are used to generate a list of URLs related to given keywords about a brand.
  • The URL list is uploaded to 80legs as a seed list, and a web crawl is started from this seed list.
  • During the web crawl, a custom data extractor (aka "80app") is used to process and cleanup the text content of a web page.
  • The results generated by the 80legs web crawl are then fed into Lingway's NLP tools, which determine sentiment.

The 80legs API and 80app framework, along with the raw bandwidth and web crawling speed provided by 80legs, lets Lingway crawl the web in a very short time for any given topic. 80legs helps Lingway with massive distributed data cleanup and enhances the performance of its own product.

New 80app: Link Mapper - Generate Sitemaps for any Website

We've just release a new 80app called Link Mapper.  Link Mapper can be used to automatically generate sitemaps for any website.

Link Mapper is available with Plus or Premium plans.  To use it, we recommend the following 80legs job settings:

  1. Seed list: the domain for which you want to generate a sitemap
  2. Outgoing links to crawl: restrict the crawl to only the domain(s) in your seed list
  3. Analysis to run: select the Link Mapper 80app

When the crawl is complete, you'll get output that looks something like this for each URL:

http://www.80legs.com
parent link: null
outgoing links:
http://www.80legs.com/_css/styles.css
http://www.80legs.com/index.html
http://www.80legs.com/_images/logo.gif
https://portal.80legs.com/portal
https://portal.80legs.com/portal/register
http://www.80legs.com/docs.html
http://www.80legs.com/contact.html
http://www.80legs.com/tour.html
http://www.80legs.com/plans.html
http://www.80legs.com/who-uses-80legs.html
http://www.80legs.com/services.html
http://www.80legs.com/about-us.html
http://blog.80legs.com
http://vimeo.com/moogaloop.swf?clip_id=11065804&server=vimeo.com&show_title=0&show_byline=0&show_portrait=0&color=&fullscreen=0&autoplay=1
http://www.80legs.com/_images/front_page_custom_crawling.gif
http://www.80legs.com/_images/front_page_crawl_package.gif
http://www.80legs.com/_images/front_page_support.gif
http://www.80legs.com/_images/front_page_view_all_pricing.gif
http://blog.80legs.com/2010/08/10/case-study-monotype-imaging
http://blog.80legs.com/2010/08/05/crawling-and-the-programmable-web
http://blog.80legs.com/2010/04/20/crawl-packages-aggregate-website-data-in-a-few-clicks
http://www.80legs.com/_images/front_page_see_whos_using.gif
http://www.leadforce1.com
http://www.leadforce1.com/bf/bf.js
http://olark.com/about

This data can then be used to generate a sitemap :)

 

Customer Discovery with 80legs

One of our users wrote up a quick little blog post about how to use web crawling to find customers.  Mark's company develops an A/B testing tool for websites.  To identify potential customers, he wanted to find websites that currently use his competitor's products.

In order to find this information, he set up a crawl on 80legs that flagged crawled web pages that had code from one of his competitors.  Mark's crawl was a very simple way to get some quick customer data.

To read the full post, click here.

Why a DIY Big Data Stack Is a Better Option

I've got an opinion piece up at GigaOm on the topic of why building your own big data stack can be a better option than using an "off-the-shelf" system.  Here's an excerpt from the piece:

Today, many conversations within the big data community are centered around the rise of the standard, big data stack, which includes utilities like HDFS, HBase, and other increasingly-popular applications. While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building their own proprietary solution.

Read the rest of the piece to find out what competitive and operational advantages (as well as disadvantages) our own stack offers us.