Antares Trader Blog

The universe at your fingertips

Regrets of a Sharded Database

Wednesday

Aug 19, 2009

4:41 pm

Database sharding, dividing your data into pieces that live in different databases, is the last resort of companies with extreme scalability problems. It is not something that should be done lightly, and now I understand why.

No I'm not having performance issues with my blogs, but I took an easy solution when I needed to take code originally designed for running a single site and making it work for more. Instead of writing all my database calls to be domain aware, I used a bit of black magic to change data sources depending on the domain to which the request comes.

#!/ruby
#Don't do this if you wish to stay sane.
def self.set_repository(r)
  if DataMapper::Repository.adapters.has_key?(r)
    Merb.logger.debug { "  using database: #{r.inspect}" }
    DataMapper::Repository.context << DataMapper::repository(r)
  end
end

def self.reset_repository(r=nil)
  if DataMapper::Repository.adapters.has_key?(r) || r.nil?
    DataMapper::Repository.context.pop
  end
end

These functions are called from my Application controller with the database that is associated with the URL domain. I have have quietly altered the default repository between each call. It seemed very clever at the time, but now it is causing more and more problems.

The first hint that something was wrong happened when I went to auto migrate my databases. My old friend rake db:automigrate missed the repositories that the important data actually lived in. So, I said to myself, "I'll do it by hand the first time then write a new rake task."

The next trick was importing data from the old WordPress blog. I had to specify which repository I wanted the data in. After a few missteps, I had to remigrate my database by hand and then hand edit my import script to do the right thing. Again, I though "Get something running, data migration is a one time task."

Numerous other little things brook along the way, Layouts that needed information from ambiguous data sources, slices that though it needed a default data source, other little things. Then came the big one.

Categorizes broke. This was not a big deal at first. They were not high on my list of things to do, but they had to be there if I wanted to import them from the old blog. I checked to make sure the data was there then went about getting some much needed fresh content in place.

When finally really looked into why the categories were broken. It turns out the DataMapper relationships make an assumption that whatever repository a resource has when it is set up will remain the same for the life of the program. When the code saw that someone had change the repository, it kindly went to put it back for me. This is not really a bug, but the documented behavior.

After some soul searching, I decided that I had no business having more then one database for a few low traffic blogs. Now the question becomes how to write domain aware finds into my code as cleanly as possible. I'll let everyone know when I find the answer.

edit delete