cloud-crawler-0.1
For the past few weeks, I have taken some time off from pure math to work on an open source platform for crawling the web. I am happy to announce the cloud-crawler version 0.1 open source project
The cloud-crawler is a distributed ruby dsl for crawling the web using amazon ec2 micro-instances. The goal is to create an end-to-end framework for crawling the web, eventually including the ability to crawl even dynamic javascript, and from a pool of spot instances.
This initial version is built using Qless, a redis-based queue, a redis-based bloomfilter, and a re-implementation and extension of the anemone dsl. It also includes chef recipes for spooling up nodes on the amazon cloud, and a Sinatra app, cloud-monitor, to monitor the queue. the basic layout is shown below from the Slideshare presentation
Here, we show an example crawl, which finds links to all Level 1 Certs on the Crossfit main page:
urls = ["http://www.crossfit.com"] CloudCrawler::crawl(urls, opts) do |cc| cc.focus_crawl do |page| page.links.keep_if do |lnk| text_for(lnk) =~ /Level 1/i end end cc.on_every_page do |page| puts page.url.to_s end end
This is a very pre-release, and we are actively looking for contributors interested in getting involved. (also, the web documentation is still in progress)
Rather than go into details, here we show how to install the crawler and get a test crawl up and running.
To install on a local machine
(i.e Mac or Linux, Ruby does not play well with Windows)
I. Dependencies
Ruby 1.9.3 with Bundler http://gembundler.com
Redis 2.6.x (stable) http://redis.io/download
it is suggested to use RVM to install ruby https://rvm.io
and to use git to obtain the source http://git-scm.com
II. Installation Steps
II.0 install ruby 1.9.3, and redis 2.6.x
II.1 install bundler
gem install bundler
II.2 clone the git source
git clone git://github.com/CalculatedContent/cloud-crawler.git
II.3 install the required gems and sources
change directories to where the Gemfile.lock file is located
cd cloud_crawler/cloud-crawler
install the gems and required source and build the gem
bundle install
to create a complete, sandbox, you can say
bundle install --path vendor/bundle
this will install the cloud_crawler in a local bundle gem repository
we use bundler locally because we also use this on amazon aws / ec2 machines
III. Testing the Install
III.1 start the redis server
redis-server &
III.2 run rake
bundle exec rake
III.3 run a test crawl
bundle exec ./test/test_crawl.rb
IV try a real crawl using the DSL
flush the redis database
redis-cli flushdb
load the first job into redis
bundle exec ./examples/crossfit_crawl.rb
run the worker job
bundle exec ./bin/run_worker.rb -n crossfit-crawl
V. To view the queue monitor in a browser
bundle exec qless-web
this should launch a tab in the web browser if this fails, the monitor may still work , and may be visible in your browser at
localhost:5678
and that’s it — you have a dsl for crawling running locally.
VI. To run the crawler on AWS and EC2, you will need to set up an amazon account, install chef-solo, and create some security groups and s3 buckets.
Stay tuned for extended documentation and examples, including seeing the crawler in action on EC2. Feel free to email to ask questions or to express interest in getting involved.
Hi Charles, how is this different then anemone? Thanks!
LikeLike
The framework is designed to run anemone (or eventually any DSL, such as Capybara)
on worker nodes the amazon cloud from a master queue, in batch jobs,
and to save the results to S3 automatically. I can also provide
chef scripts to auto-boot the master and worker nodes.
The semantics of the anemone DSL here are slightly different :
it does not crawl any links;
it also kets you select links by the text in the href tag instead of just by a regex;
it provides access to a local cache that lets you, say, maintain a counter
Also, remember that because the DSL code is running on a worker node, the DSL does
not ‘pick up’ the ruby context the way it would if it were running locally
feel free to ping me here on even on skype if you want to try it out
LikeLike