Friday, November 30, 2007

New job and polyphasic sleep again.

About 2 months ago I got a new job doing Rails web development. I really love it and also was able to go back to polyphasic sleeping. I am glad to be back and hope my schedule will let me stay with it for a very long time.

Alternative databases

When you think about the root purpose of a database, it is simply to store and access data in an orderly fashion. Relational databases like MySQL have long been the only real option (discounting flat files). Ever since I read about BigTable and S3 I have wanted a distributed data storage mechanism to go with EC2 (or for that matter, any other large web cluster that I don't have). The other day I came across CouchDB, a HTTP accessible document-oriented database server. I got quite excited, and then this article brought me back to my senses. CouchDb is cool, but it is not some holy grail. Every data storage mechanism, and there are lots, falls somewhere in the trade-off of the CAP theorem. Each is suited to a different situation. With that, I found three more data storage systems/styles for scalable web systems.

1) I didn't find out about sharding for the first time here, High Scalability has lots of good content about this , but I was reminded in my reading that RDBMS' have be around a long time and have gained many very nice features. Relational databases can also be made very scalable via sharding. In reality sharding is the basis for most any distributed system, but with most relational systems, you have to implement it yourself. Because you build the sharding mechnism, this option is extremely flexible in meeting the particular needs of your app, but it takes a lot if time.

2) Document-oriented databases like CouchDB are very nice when you just want to be able to throw data at it and deal with the variability later. My personal style is "always a prototype" and that is why I got so excited about this one. I had never read about document oriented databases other than XML databases, and the ones I found had high learning curves or lacked features that were important to me. Mind you I am only commenting on the open source space. I will most likely use this or a ruby equivalent called RDDB (Rails is my currently preferred framework) because they make it SO easy to get up and running. Perfect for prototyping. CouchDB doesn't directly address the distributed aspect yet, but will provide facilities to make sharding a piece of cake.

3) Thrudb is what I was always thinking about when I wanted a database for EC2 and S3. The creator compares it to BigTable and talks about document-oriented databases and their advantages. I highly recommend the article. This is by far my favorite, not because I can judge it's techincal merit, it just seems really cool and easy to use. On the flip side, like most good service clouds, it is a set of services that take some time and effort to deploy.

4) At the far end of the spectrum, is a style isn't much different from a file system, except that it uses S3 to distribute the data. This style, mixed with messaging and processing (like EC2 and SQS) to update views and indicies allows for an awsome amount of data processing without the manual sharding requirements of the other options. This just sounds cool because it seems like it would be linearly scalable, just throw more EC2 instances at it to update the indices faster.

I don't have any web systems bigger than 2MB and my desktop, but just thinking about the cool alternatives for data storage that are coming out makes me dreamy.