Tuesday, February 23, 2010

Webscale Computing and Hadoop

I've been using Hadoop extensively for a month or so now at the office. I've been a big fan of Doug Cutting's Lucene technology, so when I heard that Doug Cutting was the guy behind, Hadoop, that pretty much pushed me over the edge and I started using Hadoop for building new types of database indexes from multi-gigabyte datasets.

Hadoop is an incredibly rich software stack consisting of three main pieces:
  1. The Hadoop Distributed File System (HDFS)
  2. A Map/Reduce infrastructure consisting of Daemons that layer services on top of HDFS
  3. A set of APIs allowing a data processing task to be split into simple Map and Reduce phases
There's a fourth piece, which is the many related tools (HBase, Hive, Pig, etc) which are layered on top of Hadoop.

If one was to read the description of map/reduce one might conclude that it is pretty much nonsense. In fact, it sounds to trivial to even be called an algorithm. Put things into groups, operate on the groups. Big deal. It sounds pretty much like common sense. Until you work with Hadoop, you really cannot appreciate all of the benefits that the Hadoop stack brings. It's really the collective benefits of the entire stack that make for the game changing experience that programming with Hadoop really is.

With problems of processing huge datasets, the devil is in the details. Hadoop provides a framework that removes all of the annoying complexity associated with big data sets. In essence you write two simple methods and from that point on you don't care much whether you are operating on 1 byte or 1 TB. This is possible in large part because HDFS allows you to think of terrabytes worth of data as a simple URL such as hdfs://namenode/user/geoff/BigDataSet. HDFS takes care of replicating your data, and figuring out where the data blocks actually reside in the cluster. As an added bonus, Hadoop automatically deals with nuances such as the files being zipped. But wait, there's more. From the hadoop command line, you can run commands to cat and grep these URLs, again, acting as if they were simple local files.

For me, one of the most interesting side effects of running a hadoop cluster has been how it changes ones ability to interact with peers on a webscale computing project. You can now shoot your co-worker a URL to a 200GB file, and they can poke and prod at the contents by doing things like "hadoop fs -text hdfs://namenode/user/geoff/BigData".

I'll try to write more about this later, but for now suffice it to say that Hadoop is quite exciting to work with. I think the criticism by Michael Stonebreaker have totally missed the point of what a good implementation of a map/reduce framework can yield, probably because their criticism focused on the map/reduce algorithm, which in and of itself is trivial. And that's really the point. It's really all about the tools and the entire stack that make map/reduce simply a "part of this complete breakfast" when it comes to hadoop. So don't forget the toast, juice, and yummy Cheerios!

1 comment: