Protocol Buffers with Riak for Node.js

I've been playing around with Riak a bit lately. It's a simple key/value store with S3-style buckets and one-way links between keys. It also has clustering built in, and lets you run map/reduce against a set of data pretty easily. All this, over a simple HTTP API.

It's a great way to start playing with Riak, but I found it to be pretty slow. With Riak, there are two more options: use the Erlang client, or write a Protocol Buffer adapter. I'd never done anything with Protocol Buffers, so I figured this was good opportunity.

Riak PBC Client

Armed with Node.js Protocol Buffer serializing and parsing abilities, I took a look at the Riak PBC API. It has a very simple API:

00 00 00 07 09 0A 01 62 12 01 6B
|----Len---|MC|----Message-----|

Each message starts with 4 bytes for the message length, a single byte for the message code, and then the message.

The example above is how a simple request for a key might look.

// the Protocol Buffer schema.
message RpbGetReq {
    required bytes bucket = 1;
    required bytes key = 2;
    optional uint32 r = 3;
}

A Riak request looks something like this:

Schema = require('protobuf_for_node').Schema
schema = new Schema(fs.readFileSync('./riak.desc'))
GetReq = schema["RpbGetReq"]

# <Buffer 0a 01 62 12 01 6b>
data = GetReq.serialize bucket: 'b', key: 'k'
len  = data.length + 1 # account for riak code too
req  = new Buffer(len + 4) # 4 byte message length
req[0] = len >>>  24
req[1] = len >>>  16
req[2] = len >>>   8
req[3] = len &   255
req[4] = 9
data.copy req, 5, 0 # copy serialized data to the buffer

# req is now
# <Buffer 00 00 00 07 09 0a 01 62 12 01 6b>

That assembles the message. Now, we just create a tcp connection to send it to Riak:

conn = net.createConnection 8087, '127.0.0.1'

conn.on 'connect', ->
  conn.write req

Finally, something needs to listen for the data event for a response:

conn.on 'data', (chunk) ->
  len = (chunk[0] << 24) + 
        (chunk[1] << 16) +
        (chunk[2] <<  8) +
         chunk[3]  -  1 # subtract 1 for the message code
  type = lookup_type_from_code(chunk[4])
  msg  = new Buffer(len)
  chunk.copy msg, 0, 5
  data = type.parse msg

Pooling Connections

My initial example started off pretty basic, but started to grow out of control. I quickly realized that since the socket API was very synchronous, I needed to implement a connection pool so a Node.js process could have simultaneous conversations with Riak. A basic example looks like this:

riak = new (require './protobuf')()
server = http.createServer (request, response) ->
  # get a fresh connection off the pool
  riak.start (conn) ->
    # make a connection, call the given callback when it returns.
    conn.send('PingReq') (data) ->
      response.writeHead 200, 'Content-Type': 'text/plain'
      response.end sys.inspect(data)
      conn.finish() # release the connection back to the pool

# SHORTCUT
server = http.createServer (request, response) ->
  # automatically gets a fresh connection, sends a request, and releases
  # it back to the pool when done.
  riak.send('PingReq') (data) ->
    response.writeHead 200, 'Content-Type': 'text/plain'
    response.end sys.inspect(data)

nori + riak-js

Right now, this isn't in any released version of nori or riak-js. The rough Protocol Buffers client is available in the coffee branch of my riak-js fork.

When Frank released the sweet Riak-JS site, I took a hard look at what purpose nori was solving:

  • I wanted to learn more about Riak (accomplished).
  • I wanted to experiment with a new API style (very similar to Riak-js)
  • I wanted a higher level Riak lib, more like an ORM.

The goals aligned pretty closely with riak-js, so there seemed no good reason to double our efforts. I've decided to discontinue nori for the time being, and focus my Riak efforts in a refactoring of riak-js. We want to have a single lib that lets you access Riak from jQuery (maybe), as well as Node.js over the HTTP and PBC APIs.

So, what is the current progress of all this? Here are some quick benchmarks from my iMac i7:

# riak-js http API 
# ab -n 5000 -c 20 
# 734.31 req/sec
sys  = require 'sys'
http = require 'http'
db   = require('riak-js').getClient()

server = http.createServer (req, resp) ->
  db.get('airlines', 'KLM') (flight, meta) ->
    resp.writeHead 200, 'Content-Type': 'text/plain'
    resp.end sys.inspect(flight)

server.listen 8124

# riak-js PBC API
# ab -n 5000 -c 20
# 1682.01 req/sec
sys  = require 'sys'
http = require 'http'
riak = new (require './protobuf')()

server = http.createServer (req, resp) ->
  riak.send('GetReq', bucket: 'airlines', key: 'KLM') (flight) ->
    resp.writeHead 200, 'Content-Type': 'text/plain'
    resp.end sys.inspect(flight)

server.listen 8124

That's over a 2x speedup, not bad.

posted 2010 Aug 03

In-Process Node.js Queues

Node.js is great at handling lots of asynchronous connections, but sometimes I'd like to limit how many are in use. One real world example is some kind of spider or feed reader. If you have a list of 500 addresses to fetch, you don't want to fetch them all at once. Maybe they're all on one server, or the requests return large files that need some post processing.

A simple queue like Resque is great for this, but I wanted something even simpler. Something that lived in the Node.js process, and could exit cleanly without any of that persistent mess left over.

Chain Gang is the result of my experimentation. My idea is using the Node.js event system for pub/sub:

First, I specify my unit of work. In this case, I'm fetching a a web address, and calling worker.finish() after that's done.

sys:   require 'sys'
http:  require 'http'
client: http.createClient 8080, 'localhost'
# start an active chain gang queue with 3 workers by default.
chain: require('chain-gang').create()

# downloads a web page, runs the callback when it's done.
get_path: (path, cb) ->
  req: client.request('GET', path, {host: 'localhost'})
  req.end()
  req.addListener 'response', (resp) ->
    resp.addListener 'data', (chunk) ->
      sys.puts chunk
    resp.addListener 'end', cb

# returns a chain gang job that downloads a web page and finishes the worker.
job: (timeout, name) ->
  (worker) ->
    get_path "/$timeout/$name", ->
      worker.finish()

Now, I can add the callback, and queue the unit of work:

# queues the job
chain.add job(1, 'foo')

# queues the job with the unique name "foo_request"
chain.add job(1, 'foo'), 'foo_request'

Assuming the chain gang queue is active, it should start executing the jobs immediately.

There are two interesting behaviors that are possible now: Duplicate jobs are not run, and only a fixed number of jobs can run at any given time. To highlight them, I have some sample files:

  • webserver.coffee is a silly web server that waits for a specified amount of time before returning a request. A URL like "/3/foo" will return in 3 seconds, for example.
  • chain-with-dupes.coffee shows what happens when multiple jobs with the same name are queued. In this contrived example, only the first, longer one is completed. The rest are ignored.
  • chain-with-uniques.coffee shows how Chain Gang handles more jobs than workers. They just sit in an array until a free worker can take it.

On a side note, this is my first lib using npm (Node.js package manager). Type npm install chain-gang to get rockin'.

posted 2010 Jul 13

Geek Talk Interview

So, I was interviewed by The Geek Talk recently. Read on to learn the awful truth behind my early programming days :)

Also, I'm moving to San Francisco this weekend. I'm really looking forward to working side by side with my fellow GitHubbers. Portland's a great place, and I have a feeling I'll be back.

posted 2010 Jul 07

Tee and Child Processes

My first node.js project at GitHub is a replacement download server. I wanted to remove the extra moving pieces required to get it to work. One of the steps involves writing a file from the output of git archive. My initial attempt looked like this:

fs:      require('fs')
child:   require('child_process')

git:     child.spawn 'git', ['archive', 'other options']
stream:  fs.createWriteStream outputFilename

# writes the file from git archive to the file stream
git.stdout.addListener 'data', (data) ->
  # if the file stream isn't flushed, pause git's stdout
  if !stream.write(data)
    git.stdout.pause()

# once the file stream is flushed, resume git's stdout
stream.addListener 'drain', ->
  git.stdout.resume()

git.addListener 'exit', (code) ->
  stream.end()

However, git archive's tar format does not come compressed. That means I have to pipe the output to another ChildProcess object. How do I do that without a lot of code duplication? I put the common callbacks into defined functions:

fs:      require('fs')
child:   require('child_process')

# writes data to the local file system.
streamer: (data) ->
  if !stream.write(data)
    input.pause()

# pipes the data to the gzip process.
gzipper: (data) ->
  if !gzip.stdin.write(data)
    git.stdout.pause()

# closes the written file stream.  
closer: (code) ->
  stream.end()

git:     child.spawn 'git', ['archive', 'other options']
stream:  fs.createWriteStream outputFilename

stream.addListener 'drain', ->
  input.resume()

# if this is a tarball, pipe `git archive` through `gzip -n`
if outputFilename.match(/\.tar\.gz$/)
  gzip:  child.spawn 'gzip', ['-n', '-c']
  input: gzip.stdout
  gzip.stdout.addListener 'data', streamer
  gzip.addListener        'exit', closer
  gzip.stdin.addListener  'drain', ->
    git.stdout.resume()
  git.addListener 'exit', (code) ->
    gzip.stdin.end()
else
  input: git.stdout
  git.addListener 'exit', closer

git.stdout.addListener 'data', (if gzip then gzipper else streamer)

That's the code to write either git archive --format=zip or git archive --format=tar | gzip to a file. It works, but the code is more complicated than I'd like.

Ryan suggested I use tee for outputting the file, and /bin/sh to assemble the pipes.

Now, the code is even simpler than my first attempt:

child: require('child_process')

cmd: 'git archive ...'

if outputFilename.match(/\.tar\.gz$/)
  cmd += ' | gzip -n -c'

arch:    child.spawn '/bin/sh', ['-c', "$cmd | tee $outputFilename"]
arch.addListener 'exit', (code) ->
  # do something

posted 2010 Jun 28

You can let go now

If you're reading this, I've completed migrating my blog from Mephisto to Jekyll.

I had fun working on it, but clearly I failed on being able to foster a good community around it. It was an unconventional Rails app in a time before things like Rack, Sinatra, MongoDB, restful controllers, etc. It's nice to see similar ideas in newer projects though:

Thanks to everyone who helped work on it, especially Justin for the partnership on the design. Mephisto was where our working relationship started, which eventually lead to Lighthouse and ENTP...

posted 2010 Jun 23

Railsconf: Building APIs

I'm going to be taking Chris' place in the Building an API panel at Railsconf in June. I'll be speaking about the GitHub API (of course), as well as touching on my experiences building APIs for Lighthouse and Tender.

Don't despair, Chris will still be doing his Redis, Rails, and Reque talk.

posted 2010 May 17

Nori: Node.js Riak wrapper

Map Reduce?

I took the Riak Fast Track and really liked messing around with map reduce functions. So, I wrote nori, a node.js client.

Riak is a key/value store inspired by the Dynamo whitepaper. It has buckets, which contain resources identified by keys, with a REST API. Therefore, it feels a lot like S3, with added map reduce and link walking powers.

Riak is written in Erlang, but Basho decided to also support javascript for map reduce. This makes node.js a natural fit for Riak. Node.js is of course great at handling non-blocking HTTP requests, and function.toString() lets us pass javascript functions through Nori. This means it would be trivial to write local tests of your map reduce functions with local data (without having to go through Riak). Look at how closely my implementation matches the sample functions in the fast track.

Overall, the Fast Track was pretty good. I would have liked some coverage of link walking, but at some point you have to cut the "fast track" short. It was short enough to digest in a sitting (though, it did turn a chillaxin' Sunday afternoon into an epic node.js hackfest).

posted 2010 May 10

No, I did not create a mobile phone framework too

Where we're going, we don't need Rhodes

Oh man, @xinuc is breaking my heart over here. The name Rhodes is taken by some javascript mobile phone framework. Now, I need a new name. I'm leaning towards Noh-Varr if no one else has any suggestions.

Update: Okay, Jed rocks, the new project's name will be nori.

posted 2010 May 10

Escaping your test suite with your life

Arnold in Running Man

One testing feature that I really miss in test/unit, is rspec's before(:all) blocks. These are similar to your test/unit setup methods, yet they only run once for the whole test suite.

I was able to implement before(:all) callbacks in context, but the code for that is a little gnarly. Basically, I had to hack the Test::Unit::TestSuite#run method to run some callbacks before and after the test run. Downside: it doesn't work in minitest (it could, but the code isn't very inviting).

So, what if I could do something like this?

class MyTest < ActiveSupport::TestCase
  block = RunningMan::Block.new do
    # something expensive
  end

  setup { block.run(self) }
end

Running Man gives you before(:all) callbacks in test/unit and minitest (ruby 1.9). Basically, you create a RunningMan::Block and call #run in a normal setup call in your tests. The RunningMan::Block calls the block just once, and fills your test case instances with the right instance variables on each test. No muss, no ugly test/unit hacking. (Ok, before you call me a liar, that hack was to allow special teardown blocks on the last test. That feature doesn't work in ruby 1.9, unfortunately.)

This was extracted from GitHub and rewritten when I wanted to use it in a small, non-Rails project.

class MyTest < ActiveSupport::TestCase
  fixtures do
    @post = Post.make # <3 Machinist
  end

  test "check something on post" do
    assert_equal 'foo', @post.title
  end

  test "delete post" do
    @post.destroy
  end
end

posted 2010 May 04

Will the iPad kill comic books?

The Marvel App

Marvel has been dabbling in online comics for awhile. They've had digital comic previews in an awkward flash interface for years, which has recently evolved into a monthly online service. Lately, they've started producing Motion Comics. So, it's not surprising that they're the first big comic company with a serious iPad offering.

The Good

Spider-Man

Marvel was smart to partner with Comixology. Comics look great. I'd recommend picking up the free Spider-Man issues if you want to check it out. You can read a whole page, or go panel to panel if you like. There are a lot of advantages when you compare the iPad to a single comic. You can't tear anything, you can read in the dark, you can zoom in, etc. Also, the iPad holds all of your purchases (or at least, your most recent ones). You won't be stuck with a closet full of long boxes after several years. Also, I hoped that this would be better for the environment, but it's not quite as simple as that.

Navigating the store and buying issues is easy enough. You can search by Series, Genre, Creator, and Storylines. If you find something you like, you can have the app send you notifications when new titles are added. Instant gratification comic book Wednesdays!

The Bad

There's the obvious downside: These comics are protected and unshareable. In fact, it's a little worse than other iTunes content because the files are hidden from you. There's no way to export these comics for backup purposes, like you can with music or video. However, I got a rapid response from their support center confirming that you are able to re-download anything you've purchased.

If this is a deal killer for you, hope is not lost on the iPad. I've found the Comic Zeal app to be decent for reading CBR files. CBR files are basically scans of comics traded on Bit Torrent. The quality varies, and the app isn't quite as smooth as Comixology's. Personally, I only download what I already own, or seek out the graphic novels if I'm trying something new.

The Loser

Unfortunately, if digital comics become popular, then the comic shops will suffer. I like visiting Floating World Comics every week or so. Jason gives me personal recommendations (Paul Pope), advanced screenings to movies, etc. Floating Worlds sometimes holds special First Thursday events, such as the where I picked up my original David Mack page.

However, iTunes and Amazon have managed to kill off most of the indie book and music stores. I fear the same fate awaits a lot of comic shops too. What can I say though? Comics on the iPad are cheaper, and more convenient to access and store digitally. The same goes for books, movies, and music, even though I live close just a few blocks from a Powell's bookstore.

You know what I'd love? Hardcover graphic novels with DVD's full of the digital content.

The Verdict

Civil War

Comixology and Marvel have won me over. The iPad will likely become my preferred medium for reading comics. I'm hoping that they increase the catalog significantly, or at least start offering new titles. For instance, they're relaunching their flagship Avengers titles in May, and it'd be nice to pick those up on the iPad.

If you're interested in checking some comics out, I'd recommend:

  • Iron Man #1-6
  • Amazing Spider-Man #546-548
  • Astonishing X-Men
  • New Avengers
  • Wolverine #20-31 (Enemy of the State, Agent of SHIELD)
  • Captain America (Secret agent/espionage stuff)
  • Planet Hulk (Hulk plays gladiator while stranded on an alien planet)

Check out the Comixology app if superheroes aren't your thing. I've been enjoying Madman so far. Though, I'd wait a bit for Vertigo to come out with a competing service.

I'm curious how others are enjoying the comic book experience, whether you're a comic nut like me, a former reader, or you've only seen the movies.

posted 2010 Apr 05

First day at GitHub

Today's my first day at GitHub. Leaving ENTP and ActiveReload was a hard decision, but I really felt like it was time to move on to something new. I was very humbled by Chris' offer to join their growing team. Apparently they need a drill sergeant to get the app ready for Rails 3. My other recent interests (node.js, mongoDB, bourbon) also fit in well with the company.

I really just got to Portland (2 years ago), but this is a good chance to make the trek to San Francisco. I grew up in a small Kansas town, and have basically been skirting around the idea of going to SF for years. I'll likely stay in Portland for most of the summer, and move in the fall. I don't know yet -- I was actually going to do this last year and chickened out :)

I feel very fortunate to be able to move from one kick ass company to another. I'm really jazzed about what this year is going to be like at GitHub.

posted 2010 Apr 05

My Lesson in Bootstrapping

John Nunemaker wrote up his experiences in bootstrapping Harmony with Steve Smith. He makes a lot of great points that have been working out for them. I just have one quick thing to add to it all:

Embrace Open Source

I helped bootstrap both Lighthouse and Tender Support, and noticed big gains after supporting open source projects. Obviously, these products are more developer focused than most, but we also take every chance to sponsor non profit organizations, educational use, etc.

When Lighthouse launched, I was apprehensive about supporting them. I worried that we'd get some project like Rails or Wordpress on there, melting my poor servers. Really, we were extremely fortunate to get some high profile OSS projects. Our servers didn't melt (yes, we love Engine Yard, our hosting partner), and put our product in front of a lot more eyeballs.

Write these opportunities off as advertising if it helps.

posted 2010 Mar 26

Dealing with $LOAD_PATH properly

I recently went through a process of consolidating a few backend miniapps that power some boring parts of Lighthouse and Tender. I upgraded one app to Sinatra 1.0, and converted another from Rails to Sinatra. The goal was to mount them in the same rack process, therefore simplifying the deployment process all around. Doing this reinforced Ryan's sage advice about requiring rubygems in your libraries.

With libraries, it's cake. Your gem requirements are light. No one is deploying your libraries as-is, so you can assume that any configuration is handled in their applications. I'm still struggling with tests a bit, however.

  1. Do not require rubygems (or rip, bundler, etc) in any files in lib or test.
  2. Do not mess with the load path either.

Applications are a different matter. These typically will be deployed, so some kind of configuration file is essential. I try to provide examples so coworkers can get up and running really quickly. My config files typically look something like this:

# config.rb
$LOAD_PATH << ... # for setting up the Sinatra app's `lib` path and any 
                  # vendored libraries

require 'rubygems' # you can replace this with Bundler, Rip, etc
gem 'sinatra', '~> 1.0.0'
gem ...

require 'my-sinatra-app'

Now, when I re-package these in a different setting (such as when I mash two Sinatra apps into the same Rack process), I have full control over the $LOAD_PATH and the loaded gem versions.

One pattern I've adopted for apps using Sequel is some kind of #load method. I had problems where my code was loading Sequel::Model instances before the database configuration was setup. Requiring these files first would access the non-existent database configuration and blow up.

# OLD
require 'my-app' # requires 'sequel' and 'my-app/foo_model'
Sequel.db = '...'

# NEW
require 'my-app'
MyApp.load do
  Sequel.db = '...'
end

# implementation
def self.load
  require 'sequel'
  yield
  require 'my-app/foo_model'
end

For what it's worth, I've started using autoload more lately. That negates the problem completely.

# my-app.rb
require 'sequel'
module MyApp
  autoload :FooModel, 'my-app/foo_model'

# config.rb
require 'my-app'
Sequel.db = '...'
MyApp::FooModel.do_something

This is Sinatra-specific, but always subclass from Sinatra::Base. I opt for the classic Sinatra style a lot because it's so convenient. But once I have something running and tested, I make it a full class.

  1. Using the classic style adds a lot of crufty methods to every object. This can cause problems in mid to large projects.
  2. You can easily isolate and test these Sinatra classes with Rack Test. Resque's Server provides a good sample implementation with tests.
class MyAppTest < Test::Unit::TestCase
  include Rack::Test::Methods

  def app
    MyApp::Api # subclasses Sinatra::Base
  end

This problem also extends to libraries using Sinatra. At first, I couldn't figure out why one of my older Sinatra apps still used the classic Sinatra DSL. I got my answer when I converted it: ClassyResources was including itself into main. I was not too pleased.

I'm assuming this code pre-dated Sinatra's excellent extension API, so I spent an hour registering the modules as proper Sinatra extensions. I was glad I could focus my programmer rage into a good learning process.

My TwitterServer library serves as a good example of well-tested a Sinatra extension.

Following these guidelines, I was able to load both of the Sinatra apps together with a simple 3-line rackup file.

require 'config'
use MiniApp1
run MiniApp2

# thin -R config.ru start

My only suggestion if you come across crappy libraries that muck with your $LOAD_PATH is to fork away and push any patches upstream. Sorry in advance if it's one of mine :)

What good libraries out there handle this poorly? Which ones are shining examples? How do you handle similar issues?

posted 2010 Mar 26

Ruby 1.9 on Heroku

Heroku just pushed their new deployment stacks to public beta. You can now run your Heroku apps on Ruby 1.8.7 and 1.9 (as opposed to the old standard: 1.8.6).

To test this out, I whipped up Ultraviolence, a quick wrapper around the Ultraviolet gem. Ultraviolet will syntax highlight text, using parsed Textmate bundle files. Any language that Textmate supports will work, using any theme that Textmate supports.

There's also a web api if that's how you want to roll...

I chose Ultraviolet because its installation in ruby 1.8.x was always a tricky issue. Ruby needs the Onigurama regex library to parse the Textmate bundles. This means you have to install some software from a japanese geocities page and compile the onigurama gem. Ruby 1.9 uses this regex library by default, so Ultraviolet is a snap to setup.

Thanks to Heroku and RVM, it was pretty easy to get a ruby 1.9 app developed and deployed.

posted 2010 Mar 19

My talk about Twitter-Node at PDXJS

I was recently invited to talk about my Twitter Node project at last night's PDX Javascript Admirers meeting. I was really nervous about giving my first talk in several years, but I did alright. My slides are up on Heroku.

The big win of the talk, however, was Scott's showoff app for composing presentations.

Showoff is a sinatra app that builds a presentation from subdirectories of markdown files. In the past, I'd spend a lot of time trying to make Keynote presentations look pretty, or fiddling with HTML for raw slides. You know, I just don't care to do a lot of that stuff. Showoff lets me focus on the content.

One technique I really enjoyed, was showing a block of code in multiple slides and different comments. The comments 'animate', pointing at whatever it is I was talking about.

Also, the rapid Heroku deployment saved us a bit of hassle. I didn't have the right micro adapter for my laptop, so I was able to run the presentation just fine from another laptop.

posted 2010 Feb 25