cloud-crawler : an open source ruby dsl and distributed processing framework for crawling the web using aws

cloud-crawler-0.1

For the past few weeks, I have taken some time off from pure math to work on an open source platform for crawling the web.  I am happy to announce the cloud-crawler version 0.1  open source project

The cloud-crawler is a distributed ruby dsl for crawling the web using amazon ec2 micro-instances. The goal is to create an end-to-end framework for crawling the web, eventually including the ability to crawl even dynamic javascript, and from a pool of spot instances.

This initial version is built using Qless, a redis-based queue, a redis-based bloomfilter, and a re-implementation and extension of the anemone dsl.   It also includes chef recipes for spooling up nodes on the amazon cloud, and a Sinatra app, cloud-monitor, to monitor the queue.  the basic layout is shown below from the Slideshare presentation

cloud-crawler architecture

Here, we show an example crawl, which finds links to all Level 1 Certs on the Crossfit main page:

urls = ["http://www.crossfit.com"]
CloudCrawler::crawl(urls, opts)  do |cc|
  cc.focus_crawl do |page|
    page.links.keep_if do |lnk| 
       text_for(lnk) =~ /Level 1/i
    end
  end
   cc.on_every_page do |page|
     puts page.url.to_s
   end
end

This is a very pre-release, and we are actively looking for contributors interested in getting involved.  (also, the web documentation is still in progress)

Rather than go into details, here we show how to install the crawler and get a test crawl up and running.  

To install on a local machine

(i.e Mac or Linux, Ruby does not play well with Windows)

I. Dependencies

Ruby 1.9.3 with Bundler   http://gembundler.com

Redis 2.6.x  (stable)     http://redis.io/download

it is suggested to use RVM to install ruby  https://rvm.io

and to use git to obtain the source  http://git-scm.com

II.  Installation Steps

II.0  install ruby 1.9.3, and redis 2.6.x

II.1  install bundler

  gem install bundler

II.2 clone the git source

git clone git://github.com/CalculatedContent/cloud-crawler.git

II.3  install the required gems and sources

change directories to where the Gemfile.lock file is located

  cd cloud_crawler/cloud-crawler

install the gems and required source and build the gem

   bundle install

to create a complete, sandbox, you can say

  bundle install --path vendor/bundle

this will install the cloud_crawler in a local bundle gem repository

we use bundler locally because we also use this on amazon aws / ec2 machines

III. Testing the Install

III.1  start the redis server

  redis-server &

III.2  run rake

  bundle exec rake

III.3  run a test crawl

  bundle exec ./test/test_crawl.rb

IV  try a real crawl using the DSL

flush the redis database

  redis-cli flushdb

load the first job into redis

  bundle exec ./examples/crossfit_crawl.rb

run the worker job

  bundle exec ./bin/run_worker.rb -n crossfit-crawl

V.  To view the queue monitor in a browser

   bundle exec qless-web

this should launch a tab in the web browser if this fails, the monitor may still work , and  may be visible in your browser at

   localhost:5678

and that’s it — you have a dsl for crawling running locally.  

VI.   To run the crawler on AWS and EC2, you will need to set up an amazon account,  install chef-solo, and create some security groups and s3 buckets.

Stay tuned for extended documentation and  examples, including  seeing the crawler in action on EC2.  Feel free to email to ask questions or to express interest in getting involved.

3 Comments

    1. The framework is designed to run anemone (or eventually any DSL, such as Capybara)
      on worker nodes the amazon cloud from a master queue, in batch jobs,
      and to save the results to S3 automatically. I can also provide
      chef scripts to auto-boot the master and worker nodes.

      The semantics of the anemone DSL here are slightly different :

      it does not crawl any links;
      it also kets you select links by the text in the href tag instead of just by a regex;
      it provides access to a local cache that lets you, say, maintain a counter

      Also, remember that because the DSL code is running on a worker node, the DSL does
      not ‘pick up’ the ruby context the way it would if it were running locally

      feel free to ping me here on even on skype if you want to try it out

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s