Thursday, June 19, 2008

DataMapper and CouchDB

It's likely that if you're a regular reader of this hapless place I call a blog then it's likely that you either:
a) have heard me spew off at the mouth about how incredibly awesome couchdb is
or
b) have not and are likely ready to completely disagree with the entire philosophy behind it and call me an idiot. Which is fine; that's fine!

Anyway to quote the wonderful couchdb page:


Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.


So rather than having blobs of tables prone to locking or even rows that are prone to locking. Either way, no matter which RDBMS you use, you're likely to be trapped into the same issues.

A quick rant - it's not the language, it's the I/O. Be it database, disk (likely from a database) or the network; your language doesn't matter. An incredibly well-crafted application will still fall to pieces because of one of those three issues.

Back to it - so with couchdb, you store documents (think of them as rows, but they can be 3-dimensional), you can store just a standard field (text), arrays, hashes, marshal and base64 encode an object, you name it you can put it in there. And it's all linear and most importantly it's based on map/reduce.

So in the old country where you'd key in on those ever-so important relationships and eager-loading to save on the n+1 problem, you don't necessarily run into that with couchdb. Why? Because your map function will return all of the documents that are of interest to one another for your view. You're simply storing crap in all of the files, the more de-normalized the better! You can think of it sort of like this: a row could be considered a document. But within your document, you store all of the pertinent information instead of just storing an _id field with the appropriate row's id as a value. It works to get rid of those horrible join queries by de-normalizing the data.

It's really rather difficult for me to talk about as I'm still wading my way through it. I'm really not confident in my explanation about, but again I'm trying to find this as I go along as well. So feel free to post any "fixes" in the comments as seen fit.

Enter the DataMapper.

As of datamapper 0.9.2 they have added an adapter for couchdb. Some people would really like to just put some javascript together and link directly to couch to save some layers, but the fundamental problem with that (for now) is that you give up your databases security. A HUGE thing, I know. Even for people using it with ORM's in front of it, it's still problematic. You're prone to attacks because it's purely HTTP-based. And spoofing over HTTP is a joke. If you want to know why or how I know these things, just ask.

Anyways, the benefit of using an ORM in front of the database is two fold for me. First, I don't have to worry about some script-kiddie coming across, scraping my site and yanking my application out from under me. Yea, maybe that's selfish of me being an open source developer, but hey, I use open source to make; guess what? MONEY! Uh, where the hell was I? Oh yes. So the second thing aside from separating the databases secure-less protocol is simply the ease of use when tying it into the actual metal of my application. Validations, simple object creation, abstraction it's all there and makes my application a wonderful thing to work in.

Now, for every good there is certainly a bad. Currently DM's CouchDB adapter only works with version 0.7.2. There are a whole host of people trying to write ruby-based ORM's that work exclusively with CouchDB. That's great! Really, it is. But there is a long, long road to tread when you undertake such projects. Which is why I have chosen to stick by it and just alter the adapter to suit my needs and then when the time comes, release it back into the open. It's not a huge endeavor to alter the adapter as it stands now. Having hacked through the guts of it for the last week, I have come to greatly respect all of the hard, intelligent work that goes into DataMapper and Merb respectively. When comparing it to some of the internals of Rails, with all of the "magic", super calls, alias_method_chains and what not, one can truly appreciate readable, sensible code that you can literally walk through without having to do a stack trace! Excellent job, men!

Finally, this wouldn't be complete without me sharing a little nugget of information. This may not be the best way to go about handling this certain "problem" but it works well for me and maybe it will work well for you too (if you decide to drink the CouchDB punch)!

Let's talk about paging for a moment. How do you typically do it in applications like Rails, PHP, *.NET, anything really? Well, you typically need to perform two queries, right? One that gets the total count of records that you'll be acting upon and another to pull in that data a little at a time (via a limit and offset). Luckily for us, CouchDB actually provides you back the full list of documents which were queries on. Thus eliminating the count(*) query. So what I have done is modified the CouchDB adapter just slightly to return back an attr_reader which holds the total number of documents in your view. See the following code samples:
The first, is the override on how DataMapper's Collection works. My first thoughts were to just include the total rows with each object in the Collection. But that's sloppy, eats up memory and you wouldn't have access to it without instantiating one of the objects in the collection (redundant, yea?). So, let's go up a level and just attach it to the Collections themselves. Again, maybe there's a better way but hey, this is what I came up with.


module DataMapper
class Collection < LazyArray
attr_reader :_total_rows
private

def initialize(query, options = {}, &block)
assert_kind_of 'query', query, Query

unless block_given?
raise ArgumentError, 'a block must be supplied
end

@query = query
@key_properties = model.key(repository.name)
@_total_rows = options[:total_rows] || 0
super()
load_with(&block)
end

end
end


The second sample is where we pass in the total_rows attribute from the returned document's JSON store and pass it as an option to the Collection.initialize method as it is constructed. We grab the value, pass it up and it's now available in one spot for your collection as a read-only attribute. Bye-bye count query!

module DataMapper
module Adapters
class CouchDBAdapter < AbstractAdapter
def read_many(query)
doc = request do |http|
http.request(build_request(query))
end
Collection.new(query, {:total_rows => doc['total_rows']}) do |collection|
doc['rows'].each do |doc|
collection.load(
query.fields.map do |property|
property.typecast(doc["value"][property.field.to_s])
end
)
end
end
end
end
end
end


So please do yourself a favor, investigate couchdb, it's nowhere near perfect at the moment, but you owe it to yourself to expand your horizons and really marvel at the speed at which it performs. You know that "lag" in ActiveRecord/DataMapper when you're operating on rows in MySQL/Postgres or even Oracle? Guess what? It's simply not there!

In fact on a server of mine which only has 256Mb of RAM, I can write a record, pull it, build up and tear down a merb thread and respond to the browser in less than a second. The age of the created_at I store in the document reads "0 seconds ago". Now, right now you're thinking well Bradford, that's not really impressive...I mean what about concurrency? This web stuff is all about concurrency! You're right! But I forgot to mention that I was running siege on the same server running 20 concurrent connections reading the same stuff I was putting in and it was still just as fast. On cheap, un-optimized hardware it performs incredibly fast. So please, check it out - the guys who work on this thing are wonderful (irc.freenode.net #couchdb) and it's really exciting to be watching this project grow!

Next up - how mowing the grass and application development could maybe one day have a lot in common. Aside from both being done electronically.

No comments: