0.9 released! 10M page crawls, API, easy-to-use interface and more!

We just pushed out version 0.9, which is a big, big update to the system.  This release includes several upgrades to our back-end architecture (allowing larger jobs), a Java API (allowing programmatic access), an easy-to-use job form (allowing easier access), and a bunch of other cool things!

Here's a list of the specific features:

  • Large crawls are now supported.  Crawl up to 10 million pages per job!
  • The API is officially released.  Submit jobs, download results and much more using Java.
  • A much easier-to-use job form.  We realized the old job form was a bit clunky.  The new one is much easier to understand.
  • To go along with the new job form, we've updated the entire portal to be easier to navigate and use.
  • You can now load in external JARs into your 80Apps.  This lets developers use third-party code more easily.
  • Several improvements to the crawler, including:
    • Options to select your type of crawl.  Choose among fast, comprehensive, and breadth-first.
    • Crawler now crawls https:// pages.
    • Crawler tries to fetch a page more than once before giving up.

Since we just released 0.9, I suppose that technically makes us 0.1 from a beta exit!  Some of the upcoming features are:

  • Finalizing the payment system in preparation for beta exit and charging actual money.
  • Providing useful default 80Apps for all users (this is also in preparation for the app store model we'll be pursuing).

See full release log details at http://80legs.pbworks.com/Release-Log.

Released 0.83 - performance improvements and large seed lists

We pushed out 0.83 today.  This release was mostly done to push out some improvements in our crawling and back-end data store, which should help the overall performance of 80legs. We also took the opportunity to push out some new functionality, including allowing users to upload very large seed lists (up to 1 GB!).  To upload these seed lists, you'll need to go to the new "Seed Lists" section in the portal.  The interface is still a bit on the "raw" side, so let us know if you encounter any problems. You can see the full list of changes at http://80legs.pbworks.com/Release-Log#Release0838July2009.

Released 0.82 - the improvements keep coming!

We've just pushed out 0.82.  Improvements and changes include:
  • Smarter URL selection for larger crawls
  • Sandbox jobs run automatically and the user gets access to stdout from their 80App
  • Domain throttling information in the portal
  • Time estimates shown in the portal
  • Crawled result files additions:
    • page size
    • parse time in milliseconds
    • process time in milliseconds
    • compute timeouts get COMPUTE_TIMEOUT_GOOD or COMPUTE_TIMEOUT_BAD
  • Several improvements for large job performance
  • User can specify data for the jar upload which gets passed into the initialize() during the validation test
  • Fixed problem with multiple Loading Code errors
  • Improved default link parsing
  • Better web portal login behavior
As usual, we've started working on the next release already, which will have things like:
  • Allowing larger crawls
  • Allowing larger seed lists
  • Creating result files on the fly
Check out http://80legs.pbworks.com/Release-Log for all the details!

You can now run custom code on 80legs - version 0.8 released!

We're very excited to announce that you can now run custom code on 80legs.  We have just released version 0.8, which gives users the ability to write their own content analysis logic using processDocument() and their own link extraction logic using parseLinks().  For more information on how to write and run code on 80legs, please visit http://80legs.pbworks.com/Custom-Code. The total list of changes in this release include:
  • Custom code initial release (first IWebAnalysisConnector release with parseLinks() and processDocument())
  • Option to analyze specific MIME types
  • Option to preserve query strings when crawling
  • Resulting crawl list shows status codes and other reasons for failing to crawl (e.g. robots.txt, DNS, etc)
  • Better handling of failed URLs
  • Sandbox server for testing custom code on your own machine using the 80legs framework.
  • Stop problem jobs automatically
We've also granted access to several more users on our private beta list.  If you haven't received access yet, but would really like to get access soon, please let us know, and we'll try and include you in the next set of beta users. We're already working on the new features, such as:
  • A web service for programmatically submitting and managing jobs
  • An "app store" that will allow users to run pre-built applications developed by trusted third-parties
  • Our payment system, which will be released first as a "demo", allowing users to get used to the system before actually requiring payment

New Beta Release 0.76

Please see the website for the complete list of features and improvements (http://80legs.com/using.html#releases). We have bumped up the maximum number of pages to crawl in a single crawl to 1,000,000 (still free for our current beta users). For a very broad crawl, you should expect 1M pages to take about 10-20 minutes. If your crawl is restricted or is not very broad, it can take much longer that that because of the way we throttle ourselves to prevent hitting single domains and servers too hard. We are expecting this to be our last release before we push the first beta version containing our processDocument() functionality in 0.8.

Release Schedule

A lot of you may be wondering what the release schedule for 80legs looks like.  We thought it would be best to put a post detailing it as best we could.  This schedule is subject to change, but we plan on moving as quickly as possible through it.  Also, please note that because 80legs is a web-scale platform, it can be difficult to predict potential technical challenges when it comes to scaling and other issues.  Because of this, there may be times during the beta that the platform is down.  Any downtime just means we are working as furiously as possible to improve the back-end.
  • Week of March 30th: Launch the beta.  The initial release will allow crawling and content matching.  All interaction will be done through a web portal.  During the first 1-2 weeks of the beta, 80legs will be completely free to use, but the maximum size of crawls will be limited while we work out any bugs we didn't catch during internal testing.
  • Mid-April: Begin charging for use of 80legs.  Substantially increase the limit on crawl size, eventually removing any limitations.  Customers will be charged $2.00 per million pages crawled (MPC) and $0.03 per CPU-hr used for analysis.  Crawling will most likely be charged first, while charges for analysis will be introduced shortly afterward.
  • Late April/Early May: Allow custom analysis via the processPage() function (as described on the website).
  • May: Begin implementing additional functionality for analysis, such as posting data to web pages, accessing flash movies, etc.
As the schedule changes, we'll put up new posts showing the revised timeline.  Hope this helps!