Tuesday, February 23, 2010

Webscale Computing and Hadoop

I've been using Hadoop extensively for a month or so now at the office. I've been a big fan of Doug Cutting's Lucene technology, so when I heard that Doug Cutting was the guy behind, Hadoop, that pretty much pushed me over the edge and I started using Hadoop for building new types of database indexes from multi-gigabyte datasets.

Hadoop is an incredibly rich software stack consisting of three main pieces:
  1. The Hadoop Distributed File System (HDFS)
  2. A Map/Reduce infrastructure consisting of Daemons that layer services on top of HDFS
  3. A set of APIs allowing a data processing task to be split into simple Map and Reduce phases
There's a fourth piece, which is the many related tools (HBase, Hive, Pig, etc) which are layered on top of Hadoop.

If one was to read the description of map/reduce one might conclude that it is pretty much nonsense. In fact, it sounds to trivial to even be called an algorithm. Put things into groups, operate on the groups. Big deal. It sounds pretty much like common sense. Until you work with Hadoop, you really cannot appreciate all of the benefits that the Hadoop stack brings. It's really the collective benefits of the entire stack that make for the game changing experience that programming with Hadoop really is.

With problems of processing huge datasets, the devil is in the details. Hadoop provides a framework that removes all of the annoying complexity associated with big data sets. In essence you write two simple methods and from that point on you don't care much whether you are operating on 1 byte or 1 TB. This is possible in large part because HDFS allows you to think of terrabytes worth of data as a simple URL such as hdfs://namenode/user/geoff/BigDataSet. HDFS takes care of replicating your data, and figuring out where the data blocks actually reside in the cluster. As an added bonus, Hadoop automatically deals with nuances such as the files being zipped. But wait, there's more. From the hadoop command line, you can run commands to cat and grep these URLs, again, acting as if they were simple local files.

For me, one of the most interesting side effects of running a hadoop cluster has been how it changes ones ability to interact with peers on a webscale computing project. You can now shoot your co-worker a URL to a 200GB file, and they can poke and prod at the contents by doing things like "hadoop fs -text hdfs://namenode/user/geoff/BigData".

I'll try to write more about this later, but for now suffice it to say that Hadoop is quite exciting to work with. I think the criticism by Michael Stonebreaker have totally missed the point of what a good implementation of a map/reduce framework can yield, probably because their criticism focused on the map/reduce algorithm, which in and of itself is trivial. And that's really the point. It's really all about the tools and the entire stack that make map/reduce simply a "part of this complete breakfast" when it comes to hadoop. So don't forget the toast, juice, and yummy Cheerios!

Sunday, February 21, 2010

HATEOAS - Hypermedia As The Engine Of Application State

Roy Fielding's PhD dissertation on Architectural Styles and the Design of Network-based Software Architectures is the seminal paper on the RESTful style of web architecture. Before discussing the paper, let's first discuss what it is, because it seems quite unusual in its approach as a PhD dissertation. Reading this paper in the 21st century, one might conclude that the paper is an "experience paper" (as defined here). However, as noted, "An experience paper that merely confirms that which is well known as opposed to that which is widely believed is of little value." However, in the early 1990's, the topic of the dissertation was probably not widely known. The irony now, is that as the term REST becomes more and more commonplace, and is applied to non-RESTful systems, the level of confusion over REST might be rising, not falling.

I was certainly confused for a long time about what constitutes a RESTful system. Isn't anything that has a URL RESTful? Definitely not, as the presence of resource identifier is just one of several properties that make a system RESTful. Fielding has recently written this article listing the properties he sees as essential to a RESTful system. After working quite a bit on the RESTful system for NextDB.net, it seems that me that the principle of Hypermedia As The Engine Of Application State (HATEOAS), identified on the original dissertation section 5.1.5, might be the most important principle of all. Incidentally, of all the properties outlines in the recent article, Fielding chose to title the article "REST APIs Must be Hypertext Driven". To my mind, that confirms my hunch that HATEOAS is perhaps the least understood principle of REST, and the principle in need of the most discussion and explanation.

The essence of HATEOAS is that the client needs to understand as little as possible. In essence, this is quite similar to the experience a human requires when he browses a website. In order to move from one page to the next, we must be given links that we can navigate. This is our common experience on web pages. Consider an alternative, and less user friendly,  scenario in which the home page simply consisted of a set of instructions (an API, if you will). One instruction might read "To view articles about sports, visit a URL consisting of "http://sitename/topics?name=topic-name" and replace topic-name with 'SPORTS'". You would be forced to manually construct the URL and enter it into the browser. The notion of such a website seems absurd, and yet it describes the scenario common to RPC or "fake REST" APIs in which the client is responsible for constructing the state-transition links rather than being handed a hypermedia document that provides the available state transitions. 

Quoting Fielding: "...all application state transitions must be driven by client selection of server-provided choices that are present in the received representations or implied by the user’s manipulation of those representations".

This is an interesting article that shows how non-HTML hypermedia can be used in keeping with HATEOAS. However, I think there are excellent reasons why Plain Old HTML (POH) is the best hypermedia choice for use with RESTful systems:

  1. HTML was made for hypermedia (it has well known structures for encoding links)
  2.  Your mom knows what HTML is
  3. By using POH for HATEOAS, your RESTful system's content can be indexed by search engines
  4. You can easily view and traverse your system states using a web browser
POH for HATEOAS is the approach taken by NextDB's REST system. In the documentation, we do describe the explicit structure of several of the URL's, but we stick strongly to the principle outlined by Fielding that, beyond the entry point URL, or "bookmark" you shouldn't need to know the structure of any of the URLs. Using POH for HATEOAS, the proof of NextDBs' adherence to this principle is quite straightforward: I simply give you an entry point URL (a bookmark) to a table and allow you to navigate the links served in the POH. For example, here is the bookmark to an account named 'geoff' with a database named 'testchars' and a table named 'lines':

From the URL above you can click on links to visit various pages of the content and sort the content. No knowledge of an API is required. Taken to its logical conclusion, I see no reason why the RESTful service should not be allowed to directly present styled content, when the consumer of the content is human. So that is exactly what we do in NextDB, by allowing the POH to include CSS. Here is an example:


We're techically violating HATEOAS because we don't provide links to the styled content, rather the developer has to know about the available styles by reading our documentation. However, we'll soon correct this, as well as allowing the developer to pass in links to his own CSS for inclusion in the returned POH.

Finally, not only do we allow you to style the POH, but we also allow you to apply XSLT transformations to the POH in order to alter the structure (as opposed to style) of the POH. Fielding discusses "Processing Elements" or "Transformers of Data" in his thesis. I believe XSLT to be the POH analog for processing elements, in that they are well understood, easily encapsulated in a markup, and supported by a wide array of processors (including browser-side processing, although this tends to be less portable, which is why NextDB opted to perform the transformation on the server).

In summary, NextDB is a truly RESTful database. This is important not because it conforms to a buzzword, but because it has made NextDB so easy to use that even non-programmers are able to embed the POH hypermedia in their sites. Project Have Hope is a great example of one such site. The "catalogs" of sponsorable children and women in Africa is POH served out of NextDB.net.  The HATEOAS architecture is the key to opening the database content to CMS engine-driven sites, as well as accessibility to web search engines.

Monday, February 15, 2010

Spam Bucket

I'm curious to see how much spam accumulates in this database if I put a publicly writable database table on my blog

UPDATE: I have the answer: "A LOT"
"Salad Man", I commend you! "get three inches on your dick!" is a noble spam to leave on my blog, but I am dissapointed you did not include a video!!!

Saturday, February 13, 2010

NextDB REST tables

NextDB REST tables can be easily embedded anywhere on the web. For example:
<iframe src='http://www.nextdb.net/nextdb/rest/geoff/vids/sk8;~cols=PK,video_CONTENT_TYPE,video_FID/style/newspaper-a/pgsz/3' />

Friday, February 12, 2010

An update on JTIdy and XSLT

A couple months back I blogged about JTidy I have an update to that story. If you plan to run your XHTML through an XSLT transformation, don't use tidy.setXHTML(true). The reason for this, I found after a lot of debugging. There are a "named entities", specifically ones such as &acirc; (and many others) that are declared in the XHTML DTD. And guess what? They are NOT valid in XML.

Quick refresher: XML has only five named entities that it supports

So you're thinking, no biggie, I don't think my documents will ever have a &acirc; named entity in them so what do I care? Well, they can creep in my accident, if you are allowing users to save inputs into a database. For example, there are ISO-8859-1 (Latin-1) characters with no UTF-8 equivalent. One that I have repeatedly seeen over the years from European users is the "left quote". It doesn't even exist on an american keyboard, and when it gets posted to a system expecting UTF-8 it wreaks havoc, causing a run of several incorrectly decoded characters.

And so, you can get "inadvertant" named entities in your XHTML output due to this sort of character mangling where the UTF-8 byte stream interpretation gets borked. So now, instead of just getting some gibberish in your XSLT, the entire transformation crashes, and you get no output at all except a complaint that the input document had an unsupported character reference. Hence the solution: tell jtidy NOT to produce XHTML, but instead to produce plain old XML, like this: tidy.setXmlOut(true); When you do this JTidy doesn't put these named entities in anymore. 

One other surprise I ran into is that XML actually *DOES* support numeric character entities. So you will still get numeric character entities in the XML output such as &#128, which although specifically forbidden by HTML still render properly in most browsers. So JTidy happily outputs these numeric character entities in its XML output. You are now safe to apply XSLT transformations on a valid XML input, however, just be aware that the XHTML output of your XSLT transformation is in violation of the aforementioned forbidden, but practically allowed, numeric entities.

Confused yet? Me too. Let's all speak Esperanto and adopt 7-bit ASCII as the only allowed character set.