Friday, June 20, 2008

So I finally bit the bullet and did it...a merb, datamapper and couchdb lovechild.

I got up this morning, wonderful day, got in to the office early, ground up some rocket fuel coffee beans and began my development day. Started off pretty sane, but slowly it turned into a full-blown hackfest!

Suprise, suprise...look whose coding!


Yes, my Cuban bee's after finally hitting my limitations with the CouchDB 0.7.2 adapter I took one for the team and got it working for 0.8.0.

This is by no means complete as I am going to build paging into the adapter as well before I submit it for usage in DataMapper. So without any further delay...Here are the goodies. I'll again just drop in what I've overloaded so you can have a pretty seamless transition if you need it. Also, this is currently running off of the CouchDB 0.8.0 trunk (see this page for subversion linkage)

Also, note the TODO, it should be done later this evening all things considered. Last but not least, I hope this helps someone other than myself, enjoy!

module DataMapper
module Types
class Object < DataMapper::Type
primitive String
size 65535
lazy true
track :hash

def self.dump(value, property)
value.to_json unless value.nil?
end

def self.load(value, property)
value.nil? ? nil : value
end
end
end
end

module DataMapper
class Collection < LazyArray
attr_reader :_total_rows
private

def initialize(query, options = {}, &block)
assert_kind_of 'query', query, Query

unless block_given?
raise ArgumentError, 'a block must be supplied for lazy loading results', caller
end

@query = query
@key_properties = model.key(repository.name)
@_total_rows = options[:total_rows] || 0
super()
load_with(&block)
end

end
end

module DataMapper
module Adapters
class CouchDBAdapter < AbstractAdapter

def read_one(query)
doc = request do |http|
http.request(build_request(query))
end
unless doc["total_rows"] == 0
data = doc['rows'].first
query.model.load(
query.fields.map do |property|
data["value"][property.field.to_s]
end,
query)
end
end

def read_many(query)
doc = request do |http|
http.request(build_request(query))
end
Collection.new(query, {:total_rows => doc['total_rows']}) do |collection|
doc['rows'].each do |doc|
collection.load(
query.fields.map do |property|
property.typecast(doc["value"][property.field.to_s])
end
)
end
end
end
def ad_hoc_request(query)
if query.order.empty?
key = "null"
else
key = (query.order.map do |order|
"doc.#{order.property.field}"
end).join(", ")
key = "[#{key}]"
end

options = []
options << "count=#{query.limit}" if query.limit
options << "skip=#{query.offset}" if query.offset
options = options.empty? ? nil : "?#{options.join('&')}"

request = Net::HTTP::Post.new("/#{self.escaped_db_name}/_temp_view#{options}")
request["Content-Type"] = "application/json"

if query.conditions.empty?
request.body = '{"language":"javascript","map":"function(doc) { if (doc.type == \'' + query.model.name.downcase + '\') { emit(' + key + ', doc);} }"}'
else
conditions = query.conditions.map do |operator, property, value|
condition = "doc.#{property.field}"
condition << case operator
when :eql then " == '#{value}'"
when :not then " != '#{value}'"
when :gt then " > #{value}"
when :gte then " >= #{value}"
when :lt then " < #{value}"
when :lte then " <= #{value}"
when :like then like_operator(value)
end
end
request.body = '{"language":"javascript","map":"function(doc) {if (doc.type == \'' + query.model.name.downcase + '\') { if (' + conditions.join(' && ') + ') { emit(' + key + ', doc);}}}"}'
end

request
end

def request(parse_result = true, &block)
res = nil
Net::HTTP.start(@uri.host, @uri.port) do |http|
res = yield(http)
end
# debugger
JSON.parse(res.body) if parse_result
end

module Migration
def create_model_storage(repository, model)
assert_kind_of 'repository', repository, Repository
assert_kind_of 'model', model, Resource::ClassMethods

uri = "/#{self.escaped_db_name}/_design/#{model.storage_name(self.name)}"
view = Net::HTTP::Put.new(uri)
view['content_type'] = "javascript"
views = model.views.reject {|key, value| value.nil?}
# TODO: This absolutely should be handled up a level.
# You shouuld pass view a hash
#{:map => "function(doc){...}", :reduce => "function(doc){...}"}
# We'll get there...
view.body = { :views => views.each {|k,v| views[k] = {:map => v}} }.to_json

request do |http|
http.request(view)
end
end

def destroy_model_storage(repository, model)
assert_kind_of 'repository', repository, Repository
assert_kind_of 'model', model, Resource::ClassMethods

uri = "/#{self.escaped_db_name}/_design/#{model.storage_name(self.name)}"
response = http_get(uri)
unless response['error']
uri += "?rev=#{response["_rev"]}"
http_delete(uri)
end
end
end

end
end
end

Thursday, June 19, 2008

CouchDB One more thing

I almost forgot. In DataMapper, it currently stores things like arrays, hashes in couchdb perfectly. It converts them into JSON and stores them, so they're query-able! The bad thing is that when it attempts to read them out (currently) it assumes that because the property types in your model are set to "Object" that it needs to first base64 decode the marshalled/encoded object and then de-serialize it.

Well, I really don't want things that could be incredibly important being fudged in my data. Especially now that I have another dimension I can query on. So I "fixed" the object data type for when I'm working with CouchDB. This will allow you to both store and read values such as arrays and/or hashes without any voodoo. Just drop this override in your init.rb or in another place where it's sure to be loaded AFTER datamapper has been initialized.


module DataMapper
module Types
class Object < DataMapper::Type
primitive String
size 65535
lazy true
track :hash

def self.dump(value, property)
value.to_json unless value.nil?
end

def self.load(value, property)
value.nil? ? nil : value
end
end
end
end

DataMapper and CouchDB

It's likely that if you're a regular reader of this hapless place I call a blog then it's likely that you either:
a) have heard me spew off at the mouth about how incredibly awesome couchdb is
or
b) have not and are likely ready to completely disagree with the entire philosophy behind it and call me an idiot. Which is fine; that's fine!

Anyway to quote the wonderful couchdb page:


Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.


So rather than having blobs of tables prone to locking or even rows that are prone to locking. Either way, no matter which RDBMS you use, you're likely to be trapped into the same issues.

A quick rant - it's not the language, it's the I/O. Be it database, disk (likely from a database) or the network; your language doesn't matter. An incredibly well-crafted application will still fall to pieces because of one of those three issues.

Back to it - so with couchdb, you store documents (think of them as rows, but they can be 3-dimensional), you can store just a standard field (text), arrays, hashes, marshal and base64 encode an object, you name it you can put it in there. And it's all linear and most importantly it's based on map/reduce.

So in the old country where you'd key in on those ever-so important relationships and eager-loading to save on the n+1 problem, you don't necessarily run into that with couchdb. Why? Because your map function will return all of the documents that are of interest to one another for your view. You're simply storing crap in all of the files, the more de-normalized the better! You can think of it sort of like this: a row could be considered a document. But within your document, you store all of the pertinent information instead of just storing an _id field with the appropriate row's id as a value. It works to get rid of those horrible join queries by de-normalizing the data.

It's really rather difficult for me to talk about as I'm still wading my way through it. I'm really not confident in my explanation about, but again I'm trying to find this as I go along as well. So feel free to post any "fixes" in the comments as seen fit.

Enter the DataMapper.

As of datamapper 0.9.2 they have added an adapter for couchdb. Some people would really like to just put some javascript together and link directly to couch to save some layers, but the fundamental problem with that (for now) is that you give up your databases security. A HUGE thing, I know. Even for people using it with ORM's in front of it, it's still problematic. You're prone to attacks because it's purely HTTP-based. And spoofing over HTTP is a joke. If you want to know why or how I know these things, just ask.

Anyways, the benefit of using an ORM in front of the database is two fold for me. First, I don't have to worry about some script-kiddie coming across, scraping my site and yanking my application out from under me. Yea, maybe that's selfish of me being an open source developer, but hey, I use open source to make; guess what? MONEY! Uh, where the hell was I? Oh yes. So the second thing aside from separating the databases secure-less protocol is simply the ease of use when tying it into the actual metal of my application. Validations, simple object creation, abstraction it's all there and makes my application a wonderful thing to work in.

Now, for every good there is certainly a bad. Currently DM's CouchDB adapter only works with version 0.7.2. There are a whole host of people trying to write ruby-based ORM's that work exclusively with CouchDB. That's great! Really, it is. But there is a long, long road to tread when you undertake such projects. Which is why I have chosen to stick by it and just alter the adapter to suit my needs and then when the time comes, release it back into the open. It's not a huge endeavor to alter the adapter as it stands now. Having hacked through the guts of it for the last week, I have come to greatly respect all of the hard, intelligent work that goes into DataMapper and Merb respectively. When comparing it to some of the internals of Rails, with all of the "magic", super calls, alias_method_chains and what not, one can truly appreciate readable, sensible code that you can literally walk through without having to do a stack trace! Excellent job, men!

Finally, this wouldn't be complete without me sharing a little nugget of information. This may not be the best way to go about handling this certain "problem" but it works well for me and maybe it will work well for you too (if you decide to drink the CouchDB punch)!

Let's talk about paging for a moment. How do you typically do it in applications like Rails, PHP, *.NET, anything really? Well, you typically need to perform two queries, right? One that gets the total count of records that you'll be acting upon and another to pull in that data a little at a time (via a limit and offset). Luckily for us, CouchDB actually provides you back the full list of documents which were queries on. Thus eliminating the count(*) query. So what I have done is modified the CouchDB adapter just slightly to return back an attr_reader which holds the total number of documents in your view. See the following code samples:
The first, is the override on how DataMapper's Collection works. My first thoughts were to just include the total rows with each object in the Collection. But that's sloppy, eats up memory and you wouldn't have access to it without instantiating one of the objects in the collection (redundant, yea?). So, let's go up a level and just attach it to the Collections themselves. Again, maybe there's a better way but hey, this is what I came up with.


module DataMapper
class Collection < LazyArray
attr_reader :_total_rows
private

def initialize(query, options = {}, &block)
assert_kind_of 'query', query, Query

unless block_given?
raise ArgumentError, 'a block must be supplied
end

@query = query
@key_properties = model.key(repository.name)
@_total_rows = options[:total_rows] || 0
super()
load_with(&block)
end

end
end


The second sample is where we pass in the total_rows attribute from the returned document's JSON store and pass it as an option to the Collection.initialize method as it is constructed. We grab the value, pass it up and it's now available in one spot for your collection as a read-only attribute. Bye-bye count query!

module DataMapper
module Adapters
class CouchDBAdapter < AbstractAdapter
def read_many(query)
doc = request do |http|
http.request(build_request(query))
end
Collection.new(query, {:total_rows => doc['total_rows']}) do |collection|
doc['rows'].each do |doc|
collection.load(
query.fields.map do |property|
property.typecast(doc["value"][property.field.to_s])
end
)
end
end
end
end
end
end


So please do yourself a favor, investigate couchdb, it's nowhere near perfect at the moment, but you owe it to yourself to expand your horizons and really marvel at the speed at which it performs. You know that "lag" in ActiveRecord/DataMapper when you're operating on rows in MySQL/Postgres or even Oracle? Guess what? It's simply not there!

In fact on a server of mine which only has 256Mb of RAM, I can write a record, pull it, build up and tear down a merb thread and respond to the browser in less than a second. The age of the created_at I store in the document reads "0 seconds ago". Now, right now you're thinking well Bradford, that's not really impressive...I mean what about concurrency? This web stuff is all about concurrency! You're right! But I forgot to mention that I was running siege on the same server running 20 concurrent connections reading the same stuff I was putting in and it was still just as fast. On cheap, un-optimized hardware it performs incredibly fast. So please, check it out - the guys who work on this thing are wonderful (irc.freenode.net #couchdb) and it's really exciting to be watching this project grow!

Next up - how mowing the grass and application development could maybe one day have a lot in common. Aside from both being done electronically.

boy oh boy

So much stuff going on. New job, potential for my shares in an old job to be worth a great deal, summer-time, the Red Wings winning the Stanley Cup, mumbl! Hard to keep a guy like me blogging with all the stuff going on.

I'm not going to make this post about NOT blogging. It's more of an intro to what's to come. Not in a day or two, but within the next hour or so. My first topic is going to be of the freshest thing in my mind: the datamapper couchdb adapter. After that I have a phone call regarding some pretty big stuff. It could be my ticket towards some serious cash for helping guide a small start-up through the funding period. What a ride that has been over the last year and a half! Working for free (well, for stock options at least - we know what those can be worth).

So without any further ado. The real blogging item...