Frequently in my work in big data and machine learning, I need to run large calculations in parallel. There are several great tools for this, including Hadoop, StarCluster, gnu-parallel, etc. The ruby world lacks comparable tools although ruby has had distributed computing for a long time:
Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing. Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.
As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .
DSL Example
Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:

We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.
In other languages, like Java or Python, we would need a compiler-compiler like ANTLR. In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely. Lets take a look at a simple example of what our DSL should look like:
#!/usr/bin/env ruby $LOAD_PATH << "." require 'dsl' include Dsl Dsl::load do |worker| worker.doit do p "I am doing some work in parallel on several workers" end end
The doit &block
is the basic method of our DSL. We want to invoke doit
on the master node, but have it executed remotely on the worker node(s). We do this using the sourcify gem , which lets us convert the block to source (i.e. Proc#to_source
), and ship the source to the worker (i.e via a queue in redis) to be executed by instance_eval(source)
, or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports Proc#to_source
. It is a little buggy…so be careful.
Here is a complete example. The DSL is broken into 3 parts: Driver
, DSL::FrontEnd
and DSL::Core
. DSL::FrontEnd
plays the role of a traditional parser, and DSL::Core
is the remote interpreter. Driver
and DSL::FrontEnd
are executed on the master, and DSL:Core
on the workers.
require 'sourcify' module Dsl module FrontEnd def self.included(base) base.send :include, InstanceMethods end module InstanceMethods def init(opts={}, &block) @doit_block = nil yield self if block_given? end def doit(&block) @doit_block = block self end def block_sources blocks = {} blocks[:doit_block] = @doit_block.to_source return blocks end end end end
DSL::FrontEnd
uses the ruby module mixin meta-programming pattern; when Driver
includes it, the doit
method is added to the Driver
instances.
DSL::load
creates an instance of a worker
in the do block
, which in an instance Driver
and therefore now implements the doit
method.
Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.
#!/usr/bin/env ruby # # Copyright (c) 2013 Charles H Martin, PhD # # Calculated Content # http://calculatedcontent.com # charles@calculatedcontent.com # require 'dsl/front_end' require 'json' module Dsl def Dsl.load(opts={},&block) Driver.load(opts,&block) end class Driver include FrontEnd def initialize(opts = {}, &block) init(opts) #_dsl_front_end yield self if block_given? end def load # store block_sources.to_json in redis end def self.load(opts={}, &block) self.new(opts) do |core| yield core if block_given? core.load end end end end
When doit
is invoked, the &block
is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls perform(source_text)
, which then evaluates the code block in context of the worker instance using instance_eval.
(Note that in a production system, we may want to compress the source to save memory)
require 'json' module Dsl module Core def self.included(base) base.send :extend, ClassMethods # base.send :extend, InstanceMethods end module ClassMethods def perform(source_text) @doit_block = JSON.parse(source_text)['doit_block'] doit end def doit instance_eval(@doit_block).call end end end end
Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:
module Dsl module FastCore def self.included(base) base.send :extend, ClassMethods end def init(source_text) block = JSON.parse(source_text)['doit_block'] define_singleton_method(:doit, block) end def perform(source_text) doit end end end end
Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the doit
worker method:
Dsl::load do |worker| mvar = "I am a variable on the master" worker.doit do p "the worker can not see mvar" end end
And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.
If you would like to see the design pattern in action, you can can clone and run the open source Cloud-Crawler .
Related:
It’s a pity you don’t have a donate button! I’d most certainly
donate to this excellent blog! I guess for now i’ll settle for
bookmarking and adding your RSS feed to my Google account.
I look forward to brand new updates and will talk about this blog with
my Facebook group. Talk soon!
LikeLike