The 80legs Blog http://blog.80legs.com Most recent posts at The 80legs Blog posterous.com Sat, 18 Jun 2011 08:56:00 -0700 Houston Code Camp: Converting the Internet into a Single Database http://blog.80legs.com/houston-code-camp-converting-the-internet-int http://blog.80legs.com/houston-code-camp-converting-the-internet-int

Houston's first ever Code Camp will be upon us in August.  We're pretty excited about it at 80legs, since our CEO has been calling for a stronger hacker culture in Houston.  Hopefully this is the first of many quality hacker and developer-oriented events in Houston.

We've submitted a session idea entitled "Converting the Internet into a Single Database: Technologies Used & Lessons Learned" and thought it would be a good idea to provide some more details here on what this session will be about.

The Internet as a Database: What's that Mean?

Consider what's happening when you run a Google search, for say "houston restaurants".  What's really happening here?  You, as an individual, are trying to find a single data point, most likely advice on where to eat in the next few hours.  Google is very good at delivering an answer from the Internet to individuals, but it's not good at deliverinig answers to commercial organizations, or for more complex queries.

Let's say what you really want is "all houston restaurants that may need menu consultation".  (E.g., if you are a kitchen consultant).  You might want to run a query like "Find all Houston businesses that are restaurants where overall rating is < 3.0 out of 5 stars and reviews contain complaints about menu items".  This is a much more complicated query, but the data is available out there.  We just need a way of structuring and querying it.

Enter the Platform

Let's break down how we would build a platform that could serve our restaurant query and many more like it.  Here's what we'd need:

  1. The ability to collect all relevant data on the web (quickly and at-scale)
  2. A standard format for structuring data from different sources
  3. A storage system for all the data (which will probably be several billion records)
  4. A query language for retrieving data from storage
  5. A processing layer for running the query

If you look at these steps, you can start to conceptualize how a technology stack for "the Internet as a database" might look.  During our talk, we'll cover how we addressed and implemented each part of our stack, with a focus on the following questions:

  1. Should we choose to build this component in-house or use an open-source tool?
  2. How did we evaluate open-source tools for our use-case?
  3. How did we keep development of the platform on a rapid iteration cycle?
  4. What did we learn about our technology and business during the devlopment of each component?

We hope folks will come away with a better understanding of how to break down large technology goals into smaller, more manageable components as well as how to evaluate different technologies as they relate to the goal (business or otherwise) at hand.

Hopefully this provides more insight into our proposed talk!  If you have any feedback, please let us know :)  If you'd like to see our talk at Code Camp, please vote for it!

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Wed, 15 Jun 2011 11:10:00 -0700 Contest Time: IndexTank / 80legs Crawlathon! http://blog.80legs.com/contest-time-indextank-80legs-crawlathon http://blog.80legs.com/contest-time-indextank-80legs-crawlathon

We're partnering up with IndexTank to offer a contest that pits you and other hackers against each other to see who can make the best use of 80legs and IndexTank together.  Here are the contest details from IndexTank:

We like contests. A little competition is good. Brains get used, geekiness ensues, good stuff happens. Prizes, who doesn’t love prizes? I’m delighted to announce our third developer contest. Seriously, I’m reveling in paroxystic ecstasy because this one brings me back to the early days of the web. Why? Because you’ll have a chance to create your own little web search engine. No, you don’t need a team of engineers, a ton of dedicated servers and a chef.

IndexTank is teaming up with 80legs so that you can pick a chunk of the web, crawl it and index it. Finally you can create the search engine for Thundercats collectibles that you always wanted! (ok, that was just me).

Get Started:

  • Go to 80legs and sign up. You must use the referral code “contest” to get a contest account.
  • Come to IndexTank and sign up for our special contest account.
  • Create a front-end for your app (web, mobile). We recommend Heroku.
  • Read the contest rules (legalese, yada yada, sleeping aid).

Oh yeah, the prizes:

1) A shiny 11″ MacBook Air. We like them.

2) A Rovio WowWee robot AND Arduino pack from Adafruit.

3) The Art of Computer Programming, including the new volume 4A.

Also, the best Heroku-hosted app wins $100 worth of Heroku credit or a $100 Gift certificate for Amazon (even if it’s not one of the all-around top 3).

Got Questions? Contact us at any time if you have questions through our live chat on our site, #indextank on Freenode (irc) or email us at support@indextank.com.

Important Dates

  • Contest begins: June 15th at 12:01 am, Pacific Daylight Saving Time.
  • Contest ends: June 30th, 2011 at noon, PDT
  • Notification of winners: July 4th, 2011 (fireworks!)

How to Submit Your Winning App

Your application must be live and accessible to our judges by the end of the contest, and you must have completed the contest submission form (link will be posted here and tweeted by @indextank).

Your app will be judged based on:

  • Usefulness, creativity, elegance, efficiency.
  • The extent to which it takes advantage of IndexTank’s features and 80legs’ crawled data.
  • Extra points for making the source code publicly available (GitHub, Google Code, etc.) within 24 hours of the contest deadline.
  • Extra points for using indextank-jquery for your UI.

By Our Expert Judges:

  • Diego Basch, CEO IndexTank (@dbasch)
  • Shion Deysarkar, CEO 80legs (@shiondev)
  • James Lindembaum, Co-Founder, Heroku
  • Othman Laraki, Twitter / GeoAPI (@othman)

Discuss on Hacker News

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Sun, 17 Oct 2010 09:09:00 -0700 Proposal for a New Type of Robots.txt http://blog.80legs.com/proposal-for-a-new-type-of-robotstxt http://blog.80legs.com/proposal-for-a-new-type-of-robotstxt

In response to recent discussion on web crawling, Pete Warden put up a great post that discusses what rules should govern web crawlers.  I think the main thrust of the post is summarized by:

Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail.

I agree with this.  Currently, robots.txt only provides guidance on who can crawl what content.  Pete is right to point out that for most webmasters, robots.txt is just a way to tell Google what to do.  I'd like to take that line of thinking further and say that most webmasters are only interested in Google crawling them, and furthermore, this is damaging to the data industry as a whole.

At 80legs, we've seen several webmasters tell us they couldn't care less about other web crawlers besides Google.  Why?  Because they understand the benefit that Google provides them (page views, ad revenue, etc.).  They don't see the benefit provided by other web crawlers.

Your immediate response might be "Are there other benefits?"  Yes, there are.  Here are a couple obvious responses:

  1. If you're an online store, you may want web crawlers by shopping aggregators to get data from your site so they can help the aggregator build another customer channel for you.
  2. If you're a blog or some sort of content site, you want web crawlers by ad networks to find your site so they can devlop a more targeted ad channel for you.

There are also some forward-thinking ways of looking at this issue.  The large-scale use of web data provides an overall richer experience for end-users.  While the use-cases may not be immediately apparent, it's important to not unnecessarily impede the development of new technologies that require web data.

Pete's suggestion that robots.txt be more oriented toward the use of data is a great one.  I envision a new robots.txt specification that looks something like this:

User-agent: Google-bot
Allow: index // allows web pages to be included in Google's index
Allow: archive // web page content can be archived for up to XX number of days
Disallow: display-user-content // disallows displaying user-generated content such as reviews, personal information

This would require each web page to be tagged with the type of content it is.  While this might seem like a hassle, webmasters are increasingly getting used to tagging content with microdata.

Of course, this suggestion is far from perfect, but I think it's worth developing, especially as the use of web data becomes more prevalent.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Fri, 15 Oct 2010 12:16:00 -0700 Is It Time For a Web Crawling Code of Conduct? (expanded) http://blog.80legs.com/is-it-time-for-a-web-crawling-code-of-conduct http://blog.80legs.com/is-it-time-for-a-web-crawling-code-of-conduct

We've got a guest post up over at ReadWriteWeb entitled "Is it Time for a Web Crawling Code of Conduct?"  The RWW post provides a summary of how web crawling can be beneficial.  Here are some more specifics on the items mentioned:

Listening to Customers Better:
We have several customers at 80legs that use our service to collect customer reviews from various shopping websites.  This data is aggregated and scored by our clients to provide services such as media monitoring and one-stop shopping portals.

How it helps individuals:  Companies need aggregate data from the web to learn what people think about their products.  Companies that can listen better can meet the needs of their customers better.  One-stop shopping portals provide individuals with an easy way to compare prices and save money on what they’re buying

Delivering Better Ads:
An interesting use-case for web crawling is discovering and analyzing potential ad channels.  Ad networks crawl millions of web pages to find content relevant to their ad inventory.

How it helps individuals:  I’ll admit this is somewhat derivative, but I think everyone would prefer relevant ads or irrelevant on a web page, given that choice.  It should also be noted that web crawling by ad networks means even tiny blogs by individuals can get better ads with higher CTRs.

Building Better Data Sets:
Companies like Infochimps and Factual use web crawling to build better, more structured data sets from information scattered around the web.  This can be anything from property data to sports data.  Rather than having this data scattered around the web, it’s not centralized for easy consumption and analysis.

How it helps individuals:  Again, the benefit is not immediate, but it’s there.  You’ll see Factual datasets being used inline with the content of various websites, enhancing your information experience.  As Infochimps grows their dataset store, you’ll have a great resource for dataset searching.

These are just three examples.  At 80legs, we have dozens of customer verticals, and all of them contain customers that are building fascinating applications on top of web crawling.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Wed, 13 Oct 2010 13:13:00 -0700 The Grey Market for Data http://blog.80legs.com/the-grey-market-for-data http://blog.80legs.com/the-grey-market-for-data

Jud Valeski at O'Reilly Radar posted a great piece recently on the "The black market for data".  The part we feel is really worth highlighting has to do with keeping data available:

Despite black markets and TOS violations, it's important for publishers to continue to make their data widely available. Publishers get the public benefit of being labeled as open, as opposed to proprietary. They also effectively outsource many of the hard technical challenges and business models to developers who want to build products based on their data.

It's important to realize that data has value beyond the original intention of its creators, and we should encourage the creation of that value.  We should do what we can to prevent the misuse of the data by some data consumers, but we should make sure that data consumers that are using data to create additional value are encouraged to do so.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Mon, 11 Oct 2010 14:18:00 -0700 Case Study: Discovering Ad Channels by Adify http://blog.80legs.com/case-study-discovering-ad-channels-by-adify http://blog.80legs.com/case-study-discovering-ad-channels-by-adify

Good content can be hard to find. As individuals, we spend several hours each week browsing through web pages and trying to find content that's interesting. Advertising networks know that, and they want to make sure the ads they deliver are displayed next to the best and most interesting content. To do this, they crawl the web, trying to identify websites with interesting content. The more websites they know about, the more potential ad channels they have.

Adify is one of the top advertising networks in the country, and they use 80legs to help power their web crawling and analysis of interesting web content as a component of their market mappingTM methodology. To do this, Adify has created their own custom 80legs code to process the content of a web page and determine whether or not the web page or domain provides interesting and relevant content. Over time, they have built several applications on the 80legs platform to tell them whether or not a domain fits potential advertisers' needs. With the scale and customization provided by 80legs, Adify can do this quickly, easily and cost-effectively.

Adify has crawled over 50 million targeted websites with 80legs. When coupled with Adify's proprietary insights and oother industry leading sources of analytics, these crawls help expand Adify's extensive database of websites and create a comprehensive map of potential advertising channels on the web. By mapping the Internet in this manner and creating market mapsTM, Adify is able to provide their customers strategic guidance on content monetization.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 28 Sep 2010 19:36:00 -0700 Case Study: Sentiment Analysis by Lingway http://blog.80legs.com/case-study-sentiment-analysis-by-lingway http://blog.80legs.com/case-study-sentiment-analysis-by-lingway

Sentiment analysis is in big demand these days. Lingway uses natural language processing (NLP) to understand how people feel about various brands. Lingway specializes in processing text data, but they rely on the specialty of 80legs to gather that data from the Web.

Here's how Lingway's workflow handles data extraction and collection:

  • Search engines are used to generate a list of URLs related to given keywords about a brand.
  • The URL list is uploaded to 80legs as a seed list, and a web crawl is started from this seed list.
  • During the web crawl, a custom data extractor (aka "80app") is used to process and cleanup the text content of a web page.
  • The results generated by the 80legs web crawl are then fed into Lingway's NLP tools, which determine sentiment.

The 80legs API and 80app framework, along with the raw bandwidth and web crawling speed provided by 80legs, lets Lingway crawl the web in a very short time for any given topic. 80legs helps Lingway with massive distributed data cleanup and enhances the performance of its own product.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Thu, 16 Sep 2010 14:36:00 -0700 New 80app: Link Mapper - Generate Sitemaps for any Website http://blog.80legs.com/new-80app-link-mapper-generate-sitemaps-for-a http://blog.80legs.com/new-80app-link-mapper-generate-sitemaps-for-a

We've just release a new 80app called Link Mapper.  Link Mapper can be used to automatically generate sitemaps for any website.

Link Mapper is available with Plus or Premium plans.  To use it, we recommend the following 80legs job settings:

  1. Seed list: the domain for which you want to generate a sitemap
  2. Outgoing links to crawl: restrict the crawl to only the domain(s) in your seed list
  3. Analysis to run: select the Link Mapper 80app

When the crawl is complete, you'll get output that looks something like this for each URL:

http://www.80legs.com
parent link: null
outgoing links:
http://www.80legs.com/_css/styles.css
http://www.80legs.com/index.html
http://www.80legs.com/_images/logo.gif
https://portal.80legs.com/portal
https://portal.80legs.com/portal/register
http://www.80legs.com/docs.html
http://www.80legs.com/contact.html
http://www.80legs.com/tour.html
http://www.80legs.com/plans.html
http://www.80legs.com/who-uses-80legs.html
http://www.80legs.com/services.html
http://www.80legs.com/about-us.html
http://blog.80legs.com
http://vimeo.com/moogaloop.swf?clip_id=11065804&server=vimeo.com&show_title=0&show_byline=0&show_portrait=0&color=&fullscreen=0&autoplay=1
http://www.80legs.com/_images/front_page_custom_crawling.gif
http://www.80legs.com/_images/front_page_crawl_package.gif
http://www.80legs.com/_images/front_page_support.gif
http://www.80legs.com/_images/front_page_view_all_pricing.gif
http://blog.80legs.com/2010/08/10/case-study-monotype-imaging
http://blog.80legs.com/2010/08/05/crawling-and-the-programmable-web
http://blog.80legs.com/2010/04/20/crawl-packages-aggregate-website-data-in-a-few-clicks
http://www.80legs.com/_images/front_page_see_whos_using.gif
http://www.leadforce1.com
http://www.leadforce1.com/bf/bf.js
http://olark.com/about

This data can then be used to generate a sitemap :)

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Mon, 06 Sep 2010 08:00:00 -0700 Customer Discovery with 80legs http://blog.80legs.com/customer-discovery-with-80legs http://blog.80legs.com/customer-discovery-with-80legs

One of our users wrote up a quick little blog post about how to use web crawling to find customers.  Mark's company develops an A/B testing tool for websites.  To identify potential customers, he wanted to find websites that currently use his competitor's products.

In order to find this information, he set up a crawl on 80legs that flagged crawled web pages that had code from one of his competitors.  Mark's crawl was a very simple way to get some quick customer data.

To read the full post, click here.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Sun, 05 Sep 2010 08:48:00 -0700 Why a DIY Big Data Stack Is a Better Option http://blog.80legs.com/why-a-diy-big-data-stack-is-a-better-option http://blog.80legs.com/why-a-diy-big-data-stack-is-a-better-option

I've got an opinion piece up at GigaOm on the topic of why building your own big data stack can be a better option than using an "off-the-shelf" system.  Here's an excerpt from the piece:

Today, many conversations within the big data community are centered around the rise of the standard, big data stack, which includes utilities like HDFS, HBase, and other increasingly-popular applications. While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building their own proprietary solution.

Read the rest of the piece to find out what competitive and operational advantages (as well as disadvantages) our own stack offers us.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 17 Aug 2010 16:29:00 -0700 SXSWi Panel Picker is now open http://blog.80legs.com/2010/08/17/sxswi-panel-picker-is-now-open http://blog.80legs.com/2010/08/17/sxswi-panel-picker-is-now-open

Media_http80legsfiles_andgm

80legs is teaming up with InfoChimps, host of last year’s inaugural Data Cluster at SXSWi. Our panel, Data Nerds, Is Big Data Crushing the Web? , questions the future of big data and its impact on the future of tech. Here’s an excerpt from our proposal:

Web data is growing at a record pace – and data junkies will soon rule the tech world. 50 million tweets per day. 1.2 million photos served per second. 50 million websites added annually. The question is, how are we expected to build the next generation of technological innovations on top of this ever-growing Everest of data? To be honest, it can be daunting. In this panel, we’ll discuss how big data on the web changes the game for everyone. Is Hadoop good enough to manage this data explosion? Is massive web crawling dead? Is it even feasible to make such vast amounts of data open to everyone, and how do people even tap into it? Should the average Joe even care?

We’re excited to see several other panel proposals that also address the issue of making sense out of ever-more-massive amounts of data. While you’re voting us up, give these folks some thumbs as well!

Big Data for Everyone (No Data Scientists Required)

The collateral that is presently available is largely from the social media giants that tout solutions built using 10,000 node clusters that process petabytes of data a day. The reality? The average person just cannot relate or intuitively draw parallels to their own business problems. While Big Data solutions are worthwhile far before you reach petabyte scale data, just getting started can be a challenge in itself.

Data Overload: Probabilistic Computing For Breakthrough Data Analytics

With probabilistic computing, you can interpret and act on all kinds of data using statistical inference – starting with some background assumptions, you can propose possible configurations of the world that explain how that data came about. You can use probabilistic computing to trace effects back to their probable causes. For instance, what do web surfing and purchases tell us about the consumers? How can site usage patterns inform user interface design? And what are the best ways to targets ads and offers at specific users?

Beautiful Data: Interactive Visualization of Social Media

Visualizing social data teaches us about people's behavior, cultural norms, relationships and much more. The panelists are interactive visualization gurus from groups who are all trying to make sense of data - Stamen Design, IBM Research, Microsoft, New York Times and Google.

Making Sense of Social Media Data

This session presents notes from the road gathered over the last 4+ years while building Scout Labs (by Lithium Technologies). It includes discovery and acquisition of data, and the amount available. We also cover the general messiness and lack of structure of the data, and challenges in building systems to analyze it.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 10 Aug 2010 22:38:00 -0700 Case Study: Monotype Imaging http://blog.80legs.com/2010/08/10/case-study-monotype-imaging http://blog.80legs.com/2010/08/10/case-study-monotype-imaging

As millions of web pages are created every day, IP protection is an ever-growing concern for content creators. While most folks associate IP protection with things like music and movies, these are not the only types of content that need to be protected.  Monotype Imaging uses IP protection services to track the usage of font types across the web.

In order to assist its IP protection services, Monotype uses 80legs to run incredibly large scans of the web. These scans crawl across tens of thousands of popular domains and identify the location of fonts on the web pages of these domains. 80legs uses a proprietary algorithm, provided by Monotype and converted to an 80app, to check these files and extract metadata from them. Using this information, Monotype can essentially run a gigantic data collection survey of how and where particular fonts are used on the web.

The web crawl run by 80legs processes 80 million URLs in about 2 days and updates its findings on a monthly basis, though it could update more frequently if necessary. This kind of powerful web crawling enables Monotype to stay up to date and gives them unsurpassed competitive and customer intelligence.

For more information on Monotype Imaging, be sure to check out their website. If you're interested in similar services from 80legs or would like to be featured in a future newsletter, please contact us.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Fri, 06 Aug 2010 20:10:00 -0700 Reddit Ad by the Numbers for B2B Services http://blog.80legs.com/2010/08/06/reddit-ad-by-the-numbers-for-b2b-services http://blog.80legs.com/2010/08/06/reddit-ad-by-the-numbers-for-b2b-services

We’ve started experimenting with various advertising strategies here at 80legs.  So far we’ve mostly focused on Google AdWords, but we’re now looking at other channels as well.

We just wrapped up our first experiment with Reddit Ads.  I recently read both Gabriel Weinberg’s and Jason Wilk’s posts on using Reddit Ads.  Here’s a summary of the results they got with their ads:

DuckDuckGo (Gabriel):
  • Duration: 13 days
  • Cost: $650 ($50 per day)
  • Impressions: 1,288,378 (282,732 uniques)
  • Clicks: 20,700 (18,420 uniques)
  • CTR: 1.61% (6.49% unique)
Whiteyboard (Jason):
  • Duration: 2 days
  • Cost: $700 ($350 per day)
  • Impressions: 299,784 (63,000 uniques)
  • Clicks: 4,226 (4,197 uniques)
  • CTR: 1.41% (6.68% unique)
Here are the numbers for 80legs (note: only ran in Technology section):
Media_http80legsfiles_kexig
  • Duration: 6 days
  • Cost:$120 ($20 per day)
  • Impressions: 402,537 (86,174 uniques)
  • Clicks: 1,327 (1,276 uniques)
  • CTR: 0.33% (1.48% unique)
And here’s some additional data from Google Analytics:
  • Pages/Visit: 2.41
  • Avg. Time on Site: 01:11
  • Bounce Rate: 50.25%
While our numbers seem quite pitiful compared to DDG and Whiteyboard, I’m not exactly disappointed.  Both DDG and Whiteyboard are consumer products/services.  We’re a B2B service, which means our target market is much smaller in terms of # of individuals.  There are going to be far fewer people interested in web crawling than using a search engine or a whiteboard.

The most important factor is ROI.  Our ad cost $120.  I have at least 1-2 people that contacted us expressing interest in purchasing plans or custom services.  Our plus plan is just $99/month, so I’m fairly confident the ad was a “win” for us.

We also had 116 signups during the course of the Reddit ad.  A typical 6-day period will have 80 – 100 signups.  So that’s also good, but not astoundingly awesome.

Overall, I would say the ad worked well for us.  We’re going to try some variations on the ad (targeting specific crawl packages, etc.) and see if that works better.

B2C ads are going to perform better than B2B ads on pretty much any ad channel.  With a positive ROI, the ad was worth the expense and at the very least warrants additional experimentation.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Thu, 05 Aug 2010 16:19:00 -0700 Crawling and The Programmable Web http://blog.80legs.com/2010/08/05/crawling-and-the-programmable-web http://blog.80legs.com/2010/08/05/crawling-and-the-programmable-web

Today, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable web is finally coming true.

While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.

How do we access data today?

There are two core ways to access data today – via a publisher or via a crawler. Each has a different role.

Publishers have data and choose to make it publicly available through an API so that developers can easily design products powered by a given service.

Crawlers on the other hand are used to proactively go out and grab data by yourself -- scraping web pages for whatever it is you’re looking for, data that can then be used to build products, and inform better product and marketing decisions.

There something of a third option, data aggregators, like Factual and Infochimps and Hoovers. I’m not going to treat them much as part of this post because they gain access to data like the rest of us – via APIs and/or crawlers. They facilitate the distribution of that data as part of their core business (most often using a marketplace concept or subscription), but the input mechanisms are no different.

And there is potentially even a fourth option – human curation of the kind that Factual and WolframAlpha and CrowdFlower employ to acquire new data altogether. But all of these providers offer API access to their data, so I’m still going to bucket them as such.

APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I’m not calling for the discontinuation of APIs.

At 80legs, we ourselves offer a popular API, which takes a particularly hybrid approach – providing programmatic access to the data acquired via crawls.

What I really want is a natural stratification based on who is good at what, essentially. Right now, we’re asking APIs to do too much.

APIs are great for the real-time web, for example – they’re great for staying up to speed, whether that means trending search data or retweet velocity. APIs are great for enhancing functionality – whether that’ a Klout score or geolocation. APIs are great for integrating certain pieces of non-strategic infrastructure like invite codes (Prefinery) if you’re a startup in beta, or Freshbooks, if you’re an accountant. They’re also great of app-level integrations, like adding Facebook accounts to Tweetdeck, or sucking down content from Netflix.

But at a higher level, as all applications and services become more and more data-driven, it’s important to understand the differences between these different methods for extracting data, regardless of where you net out philosophically.

This is a debate that needs to take place.

Control, Control, Control

Control and flexibility are the two most important elements to look at when it comes to the difference between an API and a crawl. I also spend some time at the end of this post talking about security and privacy, because I think there are big impacts there for APIs and crawlers alike.

Cost might be a fourth facet to look at, but that’s grounds for a different post because pricing varies so widely.

Let’s start with control.

When using an API, publishers – companies like Amazon and LinkedIn, for example – control the entire process. Publishers provide you with an API account, which allows you a certain amount of calls, or requests for data per day. They also determine what kinds of content are made available, and in what context.

Publishers offer an API for many reasons. It’s financially in their best interest to have products built on top of their data to increase developer loyalty and form a kind of API-dependency to their content. It’s also useful as a way to accurately measure server usage and overall engagement, even if there’s no money involved.

APIs can go down and become unavailable, they can go from free to paid, and their publishers can be acquired by larger companies that make all manner of changes. There’s a lot of uncertainty in APIs, and many devs have learned this fact the hard way. Think back to Gnip rethinking their entire business model due to the relicensing of certain APIs.

But like moths, we so often head right back to the flame.

Crawlers act very differently. They allow much more control over the data acquisition process. This has many advantages.

For starters, the format in which content is delivered can be a lifesaver if formatted properly, or prompt hours of additional work if not.

APIs supply content in one format – the format chosen by the publisher.

Say you need a XML file type but the company only delivers JSON through their API. You’re either stuck or left spending hours re-formatting.

Crawlers let the choice-driven developer have his cake and eat it too. Formats are just another choice to make beforehand, instead of a hindrance.

Granted, standardization can be great in some cases – for example with sites like MySpace where each profile is customized and therefore rendered in HTML differently. MySpace APIs format the content to make it uniform, meaning that what was once difficult to work with as a developer (i.e. large discrepancies in the data), is now standardized and simple to use.

But the “one size fits all” mentality fails more often than you might think, especially once you step outside of the web’s largest sites – one size fits all rarely fits anyone well.

And it’s not just format – crawling offers much more control when it comes to time and timing, scope, and cost, too.

Flexibility and Availability

Data access choices are an important component of building any web product, especially when it comes to flexibility and availability. Specs change, needs change -- heck, markets change. Especially if you’re a lean startup, out early + iterate often is a way of life.

APIs only deliver content from the publisher’s site. You’re locked into a single interface’s content sources and structure, without flexibility by definition, which can be very limiting. You’re left with acquiring stand-alone datasets to supplement your evolving needs, or mashing up with another API to fill in holes.

Now, the very best API providers are great at adapting to developers’ needs and evolving alongside them. Companies like Yolink, for whom their API is their bread and butter, are particularly responsive. But too often an API is left unattended, having been a mere box to check, instead of a strategic commitment.

Unimaginative APIs can also limit use cases unwittingly, because some of the furthest-flung (if more promising) applications just aren’t supported in the calls or code. There’s a huge difference between an API that wants to be heightened and explored, and an API whose scope, if anything, constrains original thinking.

Crawlers on the other hand aren’t specific to any one site’s data, meaning that they can access content from any number of sources and compile it in one place, mixing and matching, comparing and contrasting to your heart’s delight.

Crawls can be more open-ended and investigative as well, whereas an API is more about putting a square peg in a square hole. API’s also don’t offer competitive advantage – everyone has access to the same stuff. A clever crawl can help build a moat.

Finally, crawlers can reach far beyond the capabilities of an API. Millions of pieces of data are publicly available on the web, and only a very small percentage of it is available via an API. At a certain point it’s purely an issue of volume. Much of the web is instantly crawl-able, and the amount of data available freely on the web is growing more quickly than the number of APIs by an order of magnitude. The caveat – you just have to know where to look.

The Elephant in the Room -- Security and Privacy

Let’s talk about privacy and data, because how the world evolves in this respect could have huge implications for APIs and crawlers alike.

As the recent Facebook data privacy concerns highlight, the security of people’s data is a high priority, regardless of how it may or may not be acquired or sold. Further, users expect publishers to protect their data aggressively (whether they do is another matter).

And this is a PR/perception issue as much as it is anything else.

Users worry that their data might get into the hands of people who will use it for malicious purposes, whether via an API or a crawler.  I would argue that this is not always the case, because responsible crawling companies at least, have strict licensing agreements with their clients to ensure data is used lawfully.

But, the reality is that publishers are increasingly incentivized because of public policy issues to constrain API access. And the world’s biggest crawler, Google, is starting to look evil, with the ominous question “what exactly does Google know about me?” popping up at family dinners around the country.

Some are even arguing that Facebook is bound to be federally regulated sooner or later because of its profligacy when it comes to data, and that would certainly have broad impacts. APIs are not inherently more or less secure than crawlers, but in the current climate, especially with regards to privacy, we can expect companies large and small to make less and less data open and available (something that the linked data community has been ruing as well).

Security right now is a big X factor that is going to take some time to play out.

The nice thing about crawlers (depending on your perspective) is that they are harder to control, at least for now. But it is a reasonable thing to say that data responsibility and privacy issues are going to shape and reshape this conversation big time.

Conclusion

Today’s web is full of data that if kept within an API-driven paradigm suffers from less creative use, less flexibility, and less control (from a developer standpoint).

An endlessly crawl-able web was in many ways what Tim Berners-Lee and WC3 intended for the web all along. Content creators like publishers and social networks can create sites as they've always done, while data aggregators can access data in whatever format they like.

In fact, in an older but still applicable interview with Berners-Lee, he talks about why a open, linked data web is by far preferable than APIs for data access.

There is a foundational, DNA-level need to share data. Without openness, you loose the full value and impede any future innovation in the process.

APIs absolutely have benefits – but only when we are not beholden to them – when we can use them rationally, strategically, and carefully. And when data isn’t at the crux of your site, service or application.

“We have an open API” is an overused phrase, especially as API’s are no by definition open or closed.

If you need certain attributes, like real-time/speed, certain capabilities, or certain pieces of infrastructure, there are thousands of amazing APIs out there. But if your business runs on data – crawling is the only way to go.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 20 Apr 2010 15:00:00 -0700 Crawl Packages: Aggregate Website Data in a Few Clicks http://blog.80legs.com/2010/04/20/crawl-packages-aggregate-website-data-in-a-few-clicks http://blog.80legs.com/2010/04/20/crawl-packages-aggregate-website-data-in-a-few-clicks

We're excited to announce a new service at 80legs: Crawl Packages.

What crawl packages are:

Crawl packages are pre-configured crawls that you can access and run in just a few clicks.

For a specific website or group of websites, we've designed and setup an 80legs crawl, along with custom data extractors, to crawl that site and extract all the interesting information from it.  These are crawls you could have setup yourself, but we've gone ahead and done all the work for you.

Types of crawl packages available:

We're currently offering crawl packages for social networks, retail/shopping sites and business directories.  We'll be expanding our offerings to include other websites as well.  Initial plans include crawling blogs (and their comments), semantic annotation feeds of various websites, and so on.

Results & Pricing:

Most crawl packages will cost $350 per month and produce 10 - 20 million records per month.  The type of records produced depend on the crawl package.  Social network packages produce publicly-available profiles, Retail packages produce product listings, etc.

Open Data:

We realize that the availability of crawl packages will raise some concerns over what data should be crawled and shouldn't.  We only crawl publicly-available Web data.  We don't crawl private data and have no interest in that.

What we are interested in is what our users can do with Web data that is more accessible.  Since our launch, we've seen many startups come to us asking for large amounts of Web data so that they can create additional value on top of that data.  They want to do interesting things like provide new insight into how people connect with one another, create CPIs of online product invetory, and more.  We want to make that possible, and crawl packages are a step in that direction.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 13 Apr 2010 01:06:00 -0700 Cake of the Month! http://blog.80legs.com/2010/04/12/cake-of-the-month http://blog.80legs.com/2010/04/12/cake-of-the-month

Lately I've been interested in doing some odd and quirky things around the office.  As I was thinking about what I could do about this, it struck me that one of the folks on our team, Jenn, is really awesome at baking.  So I asked her if she'd be interested in a Cake of the Month.  Each month, she'd make some funky cake that we'd all enjoy.  It's just a little thing, but something to make the work day a little more fun.  Anyway, here's what she came up with for the very first Cake of the Month!

Media_httpfarm5static_omamk

Media_httpfarm3static_ncrgf

I'd say it's a great start to a recurring tradition.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 13 Apr 2010 00:56:53 -0700 Python API Released http://blog.80legs.com/2010/04/12/python-api-released http://blog.80legs.com/2010/04/12/python-api-released The 80legs Python API is now available for use.  To learn how to access and use it, visit the 80legs Python API documentation.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Tue, 19 Jan 2010 16:49:00 -0800 New feature: 80app packs! http://blog.80legs.com/2010/01/19/new-feature-80app-packs http://blog.80legs.com/2010/01/19/new-feature-80app-packs

We've just deployed a new version of 80legs that adds an exciting new feature: 80app Packs!

Plus and Premium subscribers will now have access to a growing set of useful, pre-built 80apps.  The following 80apps are currently available or will be available soon:

Plus:

  • Return Page Content
  • Regex Text Matcher
  • Regex Source Matcher
  • Image Resizer

Premium:

  • All Plus 80apps
  • Social Network Scrapers
  • E-commerce Site Scrapers

80legs users will be able to select these apps and get the information they want from crawls with zero programming.  Everything will be pre-built and ready to go.  We want to make things as easy as possible for our users.

We plan to keep on adding more and more 80apps to Plus and Premium Plans.  If you have an idea for 80apps you'd like to see, just let us know!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Thu, 31 Dec 2009 21:00:00 -0800 Our predictions for 2010 http://blog.80legs.com/2009/12/31/our-predictions-for-2010 http://blog.80legs.com/2009/12/31/our-predictions-for-2010

I put up a post on Silicon Angle regarding my opinions on some potential trends for 2010.  While I'm no Nostradamus, what I've posted there is based on some of the things we've been seeing through our experience working on 80legs and my own experience as I get more involved in the national tech startup culture.  Take a look and let us know what you think!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs
Mon, 21 Dec 2009 14:01:00 -0800 80legs Subscription Plans and Free Web-Crawling http://blog.80legs.com/2009/12/21/80legs-subscription-plans-and-free-web-crawling http://blog.80legs.com/2009/12/21/80legs-subscription-plans-and-free-web-crawling

We have just updated 80legs with some exciting new changes.  Starting today, the 80legs service will be divided into 3 tiers: Basic, Plus and Premium.  Since the time we've launched, we've noticed that our customer base can be classified into 3 major groups - light, medium and heavy users.  Each of these plans is targeted to each group and designed to fulfill their specific needs.

Here are details on each plan:

Basic Plan:

  • Free to use
  • Normal crawling speed (up to 1 request/second/domain)
  • Access to 80legs Web Portal
  • 1 job running at a time
  • Up to 100K crawled pages per job
  • Low priority in 80legs job queue
  • No recurring jobs allowed

Plus Plan:

  • $99/month + crawling fees
  • Fast crawling speed (up to 5 requests/second/domain)
  • Access to 80legs Web Portal and API
  • Up to 3 jobs running at a time
  • Up to 1M crawled pages per job
  • Normal priority in 80legs job queue
  • Recurring jobs allowed

Premium Plan:

  • $299/month + crawling fees
  • Ultra-fast crawling speed (up to 10 requests/second/domain)
  • Access to 80legs Web Portal and API
  • Up to 5 jobs running at a time
  • Up to 10M crawled pages per job
  • Preferred priority in 80legs job queue
  • Recurring jobs allowed

Existing users can sign up for a plan by going to the new Subscription section in the 80legs Web Portal, where there are complete details and instructions on signing up for a plan.

We're really excited about these changes.  Of course, the Basic Plan now enables completely free web-crawling, which until today has been completely unheard of.  The Plus and Premium Plans give heavier users the ability to set up and run more intensive crawls.

If any of our users have questions about the changes, please contact us or submit a tickets.  We're always happy to hear from you!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/727286/spider_-_blue.jpg http://posterous.com/users/4xlpt9H0BRFD 80legs 80legs