Tuesday, May 06, 2014

A new post after a long hiatus

It's been years since i posted on this blog. The reason I stopped is that Google was erroneously flagging this blog as a security risk. This happened because I was posting cross-domain AJAX code and widgets directly into my blog and sidebar, using nextdb.net. Even after methodically removing them all (which ruined some cool elements of my blog), Google still refused to correct the classification of this blog. This really irked me and I said "to heck with blogger.com". After many years I took another look at my dormant blog, and I see Google is no longer flagging. Well, maybe I will start posting again here.

Wednesday, August 10, 2011


The following is excerpted from the hbase-user@apache.org mailing list...an instant classic!
Mongodb does an excellent job at single node scalability - they use mmap and many smart things and really kick ass ... ON A SINGLE NODE.

That single node must have raid (raid it going out of fashion btw), and you wont be able to scale without resorting to:
- replication (complex setup!)
- sharding

mongo claims to help on the last item, but it is still a risk point.

For really large data that must span multiple machines, there is no "clustered sql" type solution that isnt (a) borked in various ways (Oracle RAC I'm looking at you) or (b) stupid expensive (Oracle RAC, STILL looking at you)

Tools like HBase give you scalability at the cost of features (no automated secondary indexing, no query language).

Welcome... to... big... data.


On Thu, Aug 11, 2011 at 12:44 AM, Edward Capriolo <> wrote:
> On Wed, Aug 10, 2011 at 4:26 PM, Li Pi <> wrote:
>> You'll have to build your own secondary indexes for now.
>> On Wed, Aug 10, 2011 at 1:15 PM, Laurent Hatier
>> >wrote:
>> > Yes, i have heard this index but is it available on hbase 0.90.3 ?
>> >
>> > 2011/8/10 Chris Tarnas <>
>> >
>> > > Hi Laurent,
>> > >
>> > > Without more details on your schema and how you are finding that
>> > > number
>> > in
>> > > your table it is impossible to fully answer the question. I
>> > > suspect
>> what
>> > you
>> > > are seeing is mongo's native support for secondary indexes. If
>> > > you were
>> > to
>> > > add secondary indexes in HBase then retrieving that row should be
>> > > on
>> the
>> > > order of 3-30ms. If that is you main query method then you could
>> > reorganize
>> > > your table to make that long number your row key, then you would
>> > > get
>> even
>> > > faster reads.
>> > >
>> > > -chris
>> > >
>> > >
>> > > On Aug 10, 2011, at 10:02 AM, Laurent Hatier wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > I would like to know why MongoDB is faster than HBase to select
>> items.
>> > > > I explain my case :
>> > > > I've inserted 4'000'000 lines into HBase and MongoDB and i must
>> > calculate
>> > > > the geolocation with the IP. I calculate a Long number with the
>> > > > IP
>> and
>> > i
>> > > go
>> > > > to find it into the 4'000'000 lines.
>> > > > it's take 5 ms to select the right row with Mongo instead of
>> > > > HBase
>> > takes
>> > > 5
>> > > > seconds.
>> > > > I think that the reason is the method : cur.limit(1) with
>> > > > MongoDB but
>> > is
>> > > > there no function like this with HBase ?
>> > > >
>> > > > --
>> > > > Laurent HATIER
>> > > > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
>> > >
>> > >
>> >
>> >
>> > --
>> > Laurent HATIER
>> > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
>> >

Sunday, January 23, 2011

HBase 0.89 Bulk Import capability

I recently posted about my experience tuning HBase for mass inserts. When I learned that HBase actually supports a way to *avoid* the insert all together, I got really excited. As the Joshua Program says, "the only way to win the game is not to play"! What better way to speed up inserts than not to do them at all! You can't get much faster than zero. So how do you get data into HBase if you don't do inserts? Aha! You can actually generate HFileOutputFormat files, that essentiallly *are* database tables. And believe me, compared to running Puts from your reducer it is fast, and you don't have the "attack my database and maybe crash it" issue. To give you an idea of how fast it is, I had a job whose reducers ran for 45 minutes doing Puts. The bulk loader took 19 seconds to load the files, and it only took a couple minutes to generate the files from the reducer. So the whole bulk load process, end to end, was maybe 2 minutes.

The main "gotchas" along the way: The documentation isn't entirely clear about the fact that if you use HFileOutputFormat.configureIncrementalLoad, that your reducer is silently replaced by an internal HBase class that insures total ordering. Secondly, your mapper must output either Put or KeyValue. This is not a usable way to actually write mapreduce jobs (since you lose the ability to pass domain objects between map and reduce). Therefore, I found it best to alter my existing mapreduce jobs to produce a SequenceFileOutputFormat, in which I wrote Put from my reducer as the output class (since it doesn't interfere with your job logic/classes at all to change only the *final* output class to a Put). In fact, the class I wrote to generate the HFileOutputFormat will actually accept sequence files as input if they contain either a Put, or a single column value. I included support for a single column value just 'cause I had a couple jobs lying around that produced only a single column, and would have been a hassle to convert them to producing Puts. Then my generic bulk loader mapreduce job can be run on any of my output sequence files.

public class BulkLoader extends Configured implements Tool{

    public static class BulkLoaderMapper extends Mapper<WritableComparable, Writable, ImmutableBytesWritable, Put> {

        protected long version;
        protected String column;
        protected String family;

        protected void setup(Mapper.Context context) throws IOException, InterruptedException {
            this.version = context.getConfiguration().getLong("version", System.currentTimeMillis());
            this.column = context.getConfiguration().get("column");
            this.family= context.getConfiguration().get("family");

        private byte[] marshal(Writable o) throws IOException{
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            DataOutputStream dos = new DataOutputStream(bos);
            return bos.toByteArray();

        protected byte[] getHBaseRowKeyBytes(Writable k) throws IOException{
            return marshal(k);

        protected byte[] getHBaseFamilyNameBytes(){
            return Bytes.toBytes(family);

        protected byte[] getHBaseColumnNameBytes(){
            return Bytes.toBytes(column);

        protected byte[] getHBaseValueBytes(Writable v) throws IOException{
            return marshal(v);

        protected void map(WritableComparable key, Writable val, Context context) throws IOException, InterruptedException {
            //KeyValue kv = new KeyValue(getHBaseRowKeyBytes(key), getHBaseFamilyNameBytes(), getHBaseColumnNameBytes(), version, getHBaseValueBytes());
            if(context.getConfiguration().get("input-format", "column").equals("put")){ //the serialized file actually is full of puts
                Put p = (Put) val;
                //I am baffled as to why I cannot use getHBaseRowKeyBytes(key) here. It seems to not produce the correct key.
                //Therefore I m using p.getRow() to get the rowkey bytes.
                context.write(new ImmutableBytesWritable(p.getRow()), p);
                Put put = new Put(getHBaseRowKeyBytes(key));
                put.add(getHBaseFamilyNameBytes(), getHBaseColumnNameBytes(), version, getHBaseValueBytes(val));
                context.write(new ImmutableBytesWritable(getHBaseRowKeyBytes(key)), put);

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new BulkLoader(), args);


    public int run(String[] otherArgs) throws Exception {

        Configuration config = (Configuration) getConf();//new Configuration();
        Job job = new Job(config);


        System.out.println("sequencefile-input: " + config.get("sequencefile-input"));
        System.out.println("output-path: " + config.get("output-path"));
        System.out.println("bulk-load-table: " + config.get("bulk-load-table"));
        System.out.println("family: " + config.get("family"));
        System.out.println("column: " + config.get("column"));
        System.out.println("input-format: " + config.get("input-format", "column"));
        SequenceFileInputFormat.addInputPath(job, new Path(config.get("sequencefile-input")));
        Configuration hConfig = HBaseConfiguration.create(config);
        hConfig.setLong("version", System.currentTimeMillis());
        hConfig.set("hbase.zookeeper.quorum", config.get("zk", "zk1.x.com, zk2.x.xom"));
        job.setJobName("Bulk Loading table: " + hConfig.get("bulk-load-table","YOU HAVE NOT SET bulk-load-table PARAMETER"));
        HFileOutputFormat.setOutputPath(job, new Path(config.get("output-path")));
        HFileOutputFormat.configureIncrementalLoad(job, new HTable(hConfig, config.get("bulk-load-table")));


        return 0; //normal exit


Finally, the 0.89 jar seemed to have some packaging problems. I had to create a "lib" directory (used by hadoop in job jars) and rejar the 0.89 jar with guava.jar and a perhaps a couple other missing dependencies. With that done I was able to run the bulkloader with the completebulkload option, using ant command:

Saturday, January 08, 2011

Facebook announces websearch to rival Google's

Admittedly I've made up the headline "Facebook announces websearch to rival Google's", but is it really such an outlandish proposition? It's certainly attention grabbing, and why not? Google's websearch has basically remained the same product for 10 years. Any engineer at Google would debate this, but I doubt anyone would debate that the engineers at facebook have achieved a stunning mastery over big data sets and management of the social network. 
Based on the considerable "Big Data" and real-time data expertise at Facebook, would anyone really be surprised if such a headline surfaced? I'd find it quite a believable story: "Facebook announces intention to best Google at web search." It's hard to overstate the earthquake that such an announcement would send rumbling through silicon valley, and the entire world. In fact, what better way for Facebook to spend their $500 million in Goldman Sachs cash then acquiring promising engineers and technologies to directly attack Google's biggest Web cash cows: search and maps. Bloodying the nose of the competition by competing with their core value has always been Google's way: If an Operating System is of value, Google builds an OS. If mobile computing is of value, google builds a mobile OS, and it's own phone hardware. The day is going to come when a Google rival will compete with Google on Google's home turf: web search.

Tuesday, January 04, 2011

Minhashing is reaaally cool

What is the Jaccard Coefficient? Answer: one of the most important concepts in information retrieval and theory of sets. The Jaccard Coeffieicnt is very simple. It is the measure of the fractional similarity of two sets. Let's take  a simple example {A,B,C,D,E,F} vs. {B,E,F}. Fractional similarity of these sets is 3/6. Easily you can see that 3 elements are shared (the intersection is {B,E,F}), whereas the total number of items in both sets (the union) is 6 elements ({A,B,C,D,E,F}).

So without any mumbo-jumbo, we can see the Jaccard  Coefficient  (JC) is just the ratio of sameness (intersection) over the total amount of stuff present. Technically it's the cardinality (count) of elements in the intersection over cardinality (count) of elements in the union. So while the JC between {A,B,C,D,E,F} and {B,E,F} is 50%, we can see that the JC between {A,B,C,D,E,F,G,H,I} and {B,E,F} is only 33.33%. This illustrates the importance of the total amount of stuff present. The more irrelevant (non-intersection) stuff is present, the lower the Jaccard Coefficient.

 Now pretend we have two bags. Bag 1 contains Set 1, and Bag 2 contains Set 2. I tell you the JC between these two sets is 95%. I reach into Bag 1 and I pull out the element "M". What is the probability that the element "M" is also in Bag 2? It's 95%. That's what the Jaccard Coefficient tells me. But now here is something quite amazing. What is the probability that the lowest symbol (lowest according to some ordering, like a hashcode) in Bag 1 is the *same* letter as the lowest letter in Bag 2. Again, it's 95%. Why? Some letter has to be the lowest one in a bag. Since only 5% of the letters in the other bag ought to be different, then there is a 95% chance that the lowest symbol in the second bag falls in the intersection of the sets, and is consequently the same.

Now it really gets interesting. If I group sets by a key, where the key is the lowest element of the set, then the probability that two sets get grouped by the same key is again the Jaccard Coefficient. That's cool. That's BADASS. Why? It's an almost shockingly simple way to create clusters of similar sets.

I find a lot of convoluted explanations of Minhashing, but you really don't need to know more than this guy's explanation. Here's a short piece of code, that given a sentence, splits the sentence into words, then builds N-grams out of the words (the N-grams are the set elements). Surrounding a word with ^ and $ is a well-known way for demarcating N-grams that are beginning and endings of words. A SHA-1 cryotographic hash function is a nice way to assign a random, but consistent ordering to any possible N-gram we might come up with. As suggested by this guy, rather than clustering on the *single* lowest element, we can tighten the clusters up by making the cluster key the numMins smallest elements, concatenated.  I use this in a mapreduce job, where I write the minhash out as the key from the mapper, and the reducer collects all the sentences with the same minHash key, thereby clustering similar sentences. And lo and behold...it works.

      protected String minHash(String sentence, int nGramLength, int numMins) throws Exception{
            MessageDigest md = MessageDigest.getInstance("SHA-1");
            PriorityQueue queue = new PriorityQueue();
            for(String part :sentence.split("\\s")){
                for(String gram:getNGrams("^"+part+"$",nGramLength)){
            StringBuilder minHash = new StringBuilder();
            while(null != queue.peek() && numMins > 0){
            return minHash.toString();

Sunday, December 26, 2010

Tuning HBase and Mapreduce for Mass Inserts

For several months my colleagues and I have been earning our battle scars as we run progressively larger bulk inserts into HBase. I'm going to try to recall as many of the lessons learned as possible. At the time of this writing HBase 20.6 is the latest production release:

  1. Use  HBase 20.6. We had been running 20.3 and it suffered from a hideous deadlock bug, that would cause clients to simply stall and do nothing, until they were killed by the Hadoop task tracker. Even worse, because they were killed and restarted, this caused a vicious cycle of reducers stalling, restarting, stalling, etc. All the while threads and client connections were eaten up in HBase.
  2. Use huge numbers of mappers, and huge numbers of reducers so that your data sets are broken into bite sized pieces. To do this, set mapred.tasktracker.[map|reduce].tasks.maximum (the max value that will be run by a single task tracker at once) to a value correlated to the number of cores on each box (we'll cal this "C") and the number of boxes "B". Then set set mapred.[map|reduce].tasks (the total number of mappers or reducers that will get run) equal to a*C*B. "a" is then a free parameter which controls how big your bite sized batches of data will be. As a rule of thumb, if you are reading data out of HBase via TableMapper, I try to make a*C*B between 1000 and 10000, but it all depends on how much processing you ned to do on each chunk.
  3. This is an addendum to rule (2). All sorts of bad things happen when your mappers or reducers bite off too much data, and run for too long. Just a few items on the buffet of errors and warnings that you'll chase if your mappers or reducers try to handle too much data:
    • HBase clients leases expiring
    • Chance of OoME
    • Failure to report status for more than 600 seconds causes task to get shot in the head
    • Hbase client slows down on writes(annecdotal. I can't prove this, but I know that if you want to avoid ptential slowdowns, then following rule 2, in my experience, helps a lot)
  4. Use fine grained status. Make calls to progress(), setStatus(msg), and use counters to track what your job is doing. Nothing is more annoying that not being able to tell what the hell your job is doing when it has been running for 5 hours. Rule 2 helps here too. The more mappers and reducers you have, the more continuous and fine grained will be your status reporting as each task completes with its data load into HBase.
  5. Don't screw around too much with the default settings. Ryan Rawson has put out a good set of defaults in this PPT. You can turn these dials ad infinitum but I can pretty much promise you that there is no magic set of settings. This can be a massive time sync. Stick with Ryan's recommendations.
  6. Watch out for problems with your hostnames. Let's say your hostname is "s1.x.com". If you execute "hostname" and all you get back is "s1" or "localhost" you are in for problems. Check your /etc/hosts and make sure that aliases like "s1" don't come before the fully qualified hostname.
  7. Watch out for msiconfigured NICs. Problems with vanishing packets sent me chasing my tail for weeks. This was the classic buffet of errors, with everything seemingly leading to no resolution.
  8. Disable the Write Ahead Log from your client code. Use an appropriately sized write buffer. There'sa point of diminishing returns here. Exceeding the size of the serverside buffers won't do you any good.
  9. Don't try to get too smart. Most bulk HBase data-producing M/R jobs are going to do some processing, then write the data from the reducer. Since the reducers all receive keys in the same order, this causes all the reducers to attack the same HBase region simultaneously. We had this "great idea" that if we reversed the keys that we wrote out of our mapper, then un-reversed them in the reducer, that our reducers would be randomly writing to different region servers, not hitting a single region in lock step. Now, I have some theories on why this seemingly innocuous approach repeatably destroyed our entire Hbase database. I won't wax philosophical here, but one thing is certain. Any table created via batch inserts of randomized keys got totally hosed. Scans became dirt slow and compactions ran constantly, even days after the table was created. None of these problems made a whole lot of sense, which is why it took 3-4 weeks of debugging for us to back this "key randomizing" out of our code. The hosed tables, actually had to be dropped for the problem, and ensuing chaos to totally abate.
  10. tail all your logs and scan them assiduously for errors and warnings. I use VNCserver with different desktops for tails of all the logs so that I never have to "re-setup" all my tailf commands when I log in and out of the server.
  11. Disable speculative execution. You don't want two reducers trying to insert the exact same content into the database.

Monday, November 29, 2010

Black Friday Buys

Picked up a wireless access point for 10 bucks on buy.com. Placed it at the highest point in the house, and set it up to pull it's own IP address from DHCP off my existing wireless access point, so it's basically a range extender. For a $10 "TRENDnet", I was really impressed at how easy it was to setup. I tried the same thing with a Linksys several months back. The linksys would work on and off and just periodically die, and it cost something like $50. Anyway, I'm posting from my living room and receiving a good signal, so definitely worth the $10.

Saturday, November 27, 2010

NoSQL: Quite Possibly The Stupidest Catch Phrase Ever?

A recent article on Gigaom is titled "Why cloud computing sells and NoSQL is fading".  I'd like to parlay that discussion into one of "why NoSQL is quite possibly the stupidest catch phrase ever". Here's where I'm coming from. Clearly the "NoSQL" moniker arose from a desire to describe systems that avoided the scalability limits of relational databases. But from the get-go, the term is just...STUPID. First of all, SQL is a query language, and there are relational systems that don't use SQL. So, "NoSQL" isn't even a good way to say "non-relational". Secondly, defining systems in terms of what they *don't* do means that dozens of systems that shouldn't get grouped together, do get grouped together. Corn: NoSQL! Windbreakers: NoSQL!

Adding the confusion was this classic: Map Reduce: A Major Step Backwards.  It's amazing that this was written by Michael Stonebreaker, but this just goes to show you that even a genius can write a ridiculous article when he's writing for a PR blog. Then there was the 180 degree follow up (ooops, how stupid do we look?): Reaffirming Our Commitment to Hadoop and Mapreduce.

Honestly, the entire NoSQL thing is, to me, a mark of short sited PR as well as a failure to recognize that SQL databases can and must peacefully co-exist with systems designed to tackle other aspects of scalability and data processing. When Hadoop and Mapreduce get thrown in with "NoSQL" it especially drives me nuts, because Hadoop and Mapreduce are not even data retrieval or management systems! C'mon folks, Hadoop is for data processing. And there are lots of reasons to take the output and put it in a relational database. 

So, I was happy to see the Gigaom article. The NoSQL emperor has no clothes. It's a stupid catch phrase, and a clear indicator that the person using it is clueless.

Sunday, October 24, 2010

SPIRE 2010 Los Cabos

I recently attended the String Processing and Information Retrieval conference (SPIRE 2010). In terms of content, this conference far exceeded my expectations. Excellent presentations on the latest research. Stuff I'd *never* even heard of: Wavelet Trees for example. Very friendly presenters, who enthusiastically spent time with me between presentations to thoroughly teach some of the concepts.

 Yahoo! Research presents Time Explorer

Microsoft Research Keynote

No complaints about the venue either. Met so many cool and interesting researchers in the field of IR (which is basically "search").

Java SimpleDateFormat: simply a waste of Time

I've now sunk probably 8 hours  into trying to make a minimally robust date parser that can handle dates in a variety of formats. Sadly, I'm afraid this is the best I've been able to do, given the fragility of the Java SimpleDateFormat class. I've composed a little framework in two pieces. The first piece is a properties file that defines the formats that the user will be allowed to enter dates in. The second piece is the code that jumps through an absurd series of hoops in order to work around bugs and shortcomings in SimpleDateFormat. If anyone knows a better way, that's been *tested*, please, for the love of GOD, post a comment here.


format1=yyyy-MM-dd HH:mm:ss.SSS z
format2=yyyy-MM-dd HH:mm:ss.SSS
format3=yyyy-MM-dd HH:mm:ss
format4=yyyy-MM-dd HH:mm
format5=E, MMM dd, yyyy
format6=MMM dd yyyy
format7=MMM dd yyyy HH:mm
format8=MMM dd, yyyy
format9=MMM. dd, yyyy
format14=yyyy-MM-dd, E
format15=yyyy MMM. dd, HHh 1mmm ss's'

Notice in the formats above that all months and days are defined with two format chars. E.g. "MM" or "DD". Why not "M" or "D"?  SimpleDateFormat format strings are very rigid. We want the user to be able to enter 10/1/07 or 10/01/07. But I don't want to clutter the properties file with single and double format chars. For this reason, the code automatically pads any single digits with zero. So "1" in isolation is converted to "01". So if the user enters "Jul 7, 2001" the string is automatically zero-padded to "Jul 07, 2001". This allows the properties file to include only the double-digit variants of date numbers without having to account for every variation in which a digit could be either single or double digit. With regard to the code below, I haven't taken time to extract it as a library, so please forgive the unexplained parameters such as "id" and "paramName".This code convers the time provided by the user into the current Greenwich Mean Time.

Java Source Fragments

private static final Pattern ISOLATED_DIGIT = Pattern.compile("(^|\\D|\\s)(\\d)(\\D|$|\\s)"); //digit can be surrounded by non-digit but also begining or end of line or whitespace

public static Object parseDate(String dateString, String id, String paramName) {
        if(null == dateString){
            return dateString;
            //default time stamp is Zulu time (time at prime meridian, Greenwich)
            long currentTime = System.currentTimeMillis();
            Timestamp ts = new java.sql.Timestamp(currentTime-TimeZone.getDefault().getOffset(currentTime));
            return ts;
            dateString = zeroPadIsolatedDigits(dateString);
            if(log.isDebugEnabled())log.debug("input zero-padded to " + dateString);
            for(Map.Entry entry:DATE_FORMATS.entrySet()){
                String key = entry.getKey().toString();
                DateFormat dateFormat = new SimpleDateFormat(entry.getValue().toString());
                    java.util.Date date = dateFormat.parse(dateString);
                    //I have found bugs in the parse method, in which strings that *don't* actually match the format template
                    //get parsed without Exception, thus truncating trailing parts of the date (like the timezone). For this reason
                    //I am forced to regurgitate the date string, and compare it to the input, and if they don't match, skip forward in the loop
                    String regurgitated = dateFormat.format(date);//...and no, we can't just compare the regurgitated string b/c PST and get adjusted to PDT
                    if(regurgitated.length() != dateString.length()){ //compare their lengths to make sure nothing got truncated
                        if(log.isDebugEnabled())log.debug(dateString + " was parsed with format " + entry.getValue() + " but was apparently parsed incorrectly as " + regurgitated + " length difference was "+(regurgitated.length()-dateString.length()));
                        //the length of the regurgitated string matches the length of the input.
                        //We still need to eat the regurgitated string, and compare the resulting  ms since epoch to the Date we originally parsed to make sure we didn't
                        //encounter a SimpleDateFormat parsing bug.
                         //Example: 2010-9-23 12:22:23.000 PST gets regurgintated as 2010-09-23 13:22:23.000 PDT, which is different text, but same ms since epoch
                        //So the above illustrates why we cannot just compare the regurgitated text to the input dateString, because Java may decide on an equivalent but
                        //textually different representation from the input (PST is the same as PDT in the half of the year when daylight savings time isn't in effect).
                        java.util.Date reparsed = dateFormat.parse(regurgitated);
                        if(reparsed.getTime() != date.getTime()){
                            if(log.isDebugEnabled())log.debug(dateString+" produces different ms since epoch than  " +regurgitated );
                    if(log.isDebugEnabled())log.debug("handled date" + dateString +" using format template: " + entry.getValue().toString() + " which regurgitated " + dateString);
                    TimeZone timeZone = dateFormat.getTimeZone();
                    //convert to GMT time and account for timezone
                    return new java.sql.Timestamp(date.getTime()-timeZone.getOffset(date.getTime()));
                }catch(ParseException pe){
                        log.debug(dateString + "couldn't be parsed with format " + entry.getValue());
            throw Util.newApplicationException(id, "date "+ dateString+" could not be parsed.","DataFormatException", 130, paramName);

     private static String zeroPadIsolatedDigits(String s) {
        //Java date parsing will prepend leading zeros to islolated digits. For this reason, I prepend a zero to the isolated digits before parsing the string, so that the regurgitation step
        //produces a string identical to the input
        boolean done = false;
        //this is an obscure case in which the matching regions overlap, therefore m.find() only finds one of the overlapping instance.
        //For instance, 100-1-2. Find() won't find "-2 " because it overalps "-1-". Consequently we have to repeatedly call find on a new matcher after each replacement has been made
        while (!done) {
            Matcher m = ISOLATED_DIGIT.matcher(s);
            StringBuffer buf = new StringBuffer();
            if (m.find()) {
                if(log.isDebugEnabled())log.debug("found isolated digit:" + m.group(2));
                m.appendReplacement(buf, m.group(1) + "0" + m.group(2) + m.group(3)); //need to stuff back in the stuff before the isolated digit, then 0 (zero-pad), then the digit, then stuff after)
            }else {
                done = true;
            m.appendTail(buf); //OK, al isolated digits have been replaced by leading-zero padded digits
            s = buf.toString(); //replace input with zero-padded input
        return s;

Saturday, October 23, 2010

Data Storm

This morning's project was installing a Davis Vantage Pro weather station. It's really an amazing piece of equipment. It logs windspeed and direction, solar radiation, and dozens of other variabls. It even has a rain collector. The data comes off the unit in real-time, and is transmitted to a nice LCD console that shows graphs, weather forecasts etc. I hope to log enough data on sun and wind in order to make a determination as to whether solar panels or wind power will be my best alterative energy source. I'm also interested in quantifying just how bizarre the microclimate of western san francisco actually is. How many days is it foggy? What time does the fog typically roll in or out in a given month? With a bit of free time, I hope to be able to load the data into nextdb to create a detailed data history. 

 I'm already getting some valuable siting data on windspeed. There are several interesting vertical-axis helical wind-power generators that might be appropriate. Hmmm.

Saturday, September 25, 2010

Government "Black Ice" Becomes a Reality

Keeping with the William Gibson theme, the future - whoops, I meant "the present"- look increasingly like something dreamed up by said sci fi author.  This week we learned about STUXNET: Government Black Ice designed to attack industrial control facilities, and in particular the Bashir Nuclear Reactor. The sophisticated source code includes the ability to inject itself into programmable logic devices; hacking the actual hardware itself, relying on proprietary hardware secretes stolen from two Taiwanese manufacturers.  Apparently too complex to have been written by a hacker in a dorm room, the media concludes that it must be "state sponsored". Maybe. Probably. But in a present that increasingly looks like the cyberscape described in Neuromancer we shouldn't rule out multinational corporations as the source. As pointed out in the media, STUXNET *is* a *weapon*. Cyberspace is a battlefield where corporations, states, and individuals can wage war on a practically equal footing, with the ability to cloak their identity. Imagine that; anonymous war with motives known only to the participants. Target: a nuclear reactor.

Wednesday, September 01, 2010

Awesome Op-ed in New York Times by William Gibson

This a fantastic Op-ed by William Gibson on Google. 

Thursday, July 15, 2010

Neuromancer Vs. Matrix

Neuromancer was written by William Gibson and published in 1984.  I first read it high school back in 1993. Every few years I read it again, and each time I find something new to like.

In my most recent reading, I find myself noticing how much of the concept matter in The Matrix is based on Neuromancer. Gibson coined the term "The Matrix" and "cyberspace", and Gibson describes a Rasta "freeside" called "Zion" (albeit in orbit), and simstim memory sticks that jack in through a socket behind the ear. It's hard to imagine that Gibson wrote Neuromancer before there was a world wide web.

"The Matrix is ... a concentual hallucination. ..A graphic representation of data abstracted from the banks of every computer in the human system. Unthinkable complexity. Lines of light arranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding..."

This part reminds me of the "symbols" that fall down the screens in the movie The Matrix:

"And in the bloodlit dark behind his eyes, silver phosphenes boiling in from the edge of space, hypnagogic images jerking past like film compiled from random frames. Symbols, figures, faces, a blurred fragmented mandala of visual information". 

Wednesday, July 07, 2010

Android Earcon

Did you notice that Google's Android phones have a startup sound that's quite familiar...that is, if you are a fan of Blade Runner. The little "bootup jingle" is from the opening scene soundtrack by Vangelis in  Blade Runner. Listen here, its at 1 minute 40 seconds.

Hadoop Summit

Last week I attended the Hadoop Summit. It was sold out ten days prior! That doesn't happen very often these days, in an environment where corporate travel is severely cut back. I even saw sizable contingents from Japan and Brazil. This is a very exciting time to be involved in the field of search and data processing. The science track was quite interesting, as was the talk by facebook about how they are crunching something like 90 TB per week, and are siting on 85 Peta Bytes. They are using hadoop mapreduce in near real-time. Processing latencies are getting close to 1 minute. That's nearly real-time availability of all the information you post on facebook, to everyone else (and every moving part) in the facebook system. Oh, and the lunch was great.

Tuesday, July 06, 2010

badassery with bits: counting 1 bits

I was recently looking at some code to count the number of bits in an integer that are set to 1. Here is the method:

int = getARandomInt();
   n &= n-1;
print("There are "+count+" bits turned on.");

Can you figure out why it works?

hint: negative one has every bit turned on because negating a number is to take it's two's complement (which flips every bit), and add 1. Watch what happens with the carry bit when you add n to -1.

Sunday, June 13, 2010

Matrix Challenge

Here is an interesting challenge posted on the website of a company called Rapleaf:

We wish to find, for an arbitrary N, an N x N matrix with the following property: each entry in the matrix is the smallest non-negative number which does not appear either above the entry or to its left.
  1. Write a function that computes the entry in a particular row and column. That is, complete the following:

    int entry_at(int row, int column) {
        // ...
  2. Can you write the entry_at function to run in constant space and constant time?
  3. Can you prove that your algorithm is correct?

 For instance, here is a 6x6

I posted a solution before, which was wrong, because I only looked at a 6x6, and my solution failed for larger matrices. Thanks to whomever posted the comment pointing out that my solution was broken. So I rewrote it in Java, after looking at some larger matrices.  Here is a printout of the 32x32 matrix produced by the program.

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
 1  0  3  2  5  4  7  6  9  8 11 10 13 12 15 14 17 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30 
 2  3  0  1  6  7  4  5 10 11  8  9 14 15 12 13 18 19 16 17 22 23 20 21 26 27 24 25 30 31 28 29 
 3  2  1  0  7  6  5  4 11 10  9  8 15 14 13 12 19 18 17 16 23 22 21 20 27 26 25 24 31 30 29 28 
 4  5  6  7  0  1  2  3 12 13 14 15  8  9 10 11 20 21 22 23 16 17 18 19 28 29 30 31 24 25 26 27 
 5  4  7  6  1  0  3  2 13 12 15 14  9  8 11 10 21 20 23 22 17 16 19 18 29 28 31 30 25 24 27 26 
 6  7  4  5  2  3  0  1 14 15 12 13 10 11  8  9 22 23 20 21 18 19 16 17 30 31 28 29 26 27 24 25 
 7  6  5  4  3  2  1  0 15 14 13 12 11 10  9  8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 
 8  9 10 11 12 13 14 15  0  1  2  3  4  5  6  7 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 
 9  8 11 10 13 12 15 14  1  0  3  2  5  4  7  6 25 24 27 26 29 28 31 30 17 16 19 18 21 20 23 22 
10 11  8  9 14 15 12 13  2  3  0  1  6  7  4  5 26 27 24 25 30 31 28 29 18 19 16 17 22 23 20 21 
11 10  9  8 15 14 13 12  3  2  1  0  7  6  5  4 27 26 25 24 31 30 29 28 19 18 17 16 23 22 21 20 
12 13 14 15  8  9 10 11  4  5  6  7  0  1  2  3 28 29 30 31 24 25 26 27 20 21 22 23 16 17 18 19 
13 12 15 14  9  8 11 10  5  4  7  6  1  0  3  2 29 28 31 30 25 24 27 26 21 20 23 22 17 16 19 18 
14 15 12 13 10 11  8  9  6  7  4  5  2  3  0  1 30 31 28 29 26 27 24 25 22 23 20 21 18 19 16 17 
15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
17 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30  1  0  3  2  5  4  7  6  9  8 11 10 13 12 15 14 
18 19 16 17 22 23 20 21 26 27 24 25 30 31 28 29  2  3  0  1  6  7  4  5 10 11  8  9 14 15 12 13 
19 18 17 16 23 22 21 20 27 26 25 24 31 30 29 28  3  2  1  0  7  6  5  4 11 10  9  8 15 14 13 12 
20 21 22 23 16 17 18 19 28 29 30 31 24 25 26 27  4  5  6  7  0  1  2  3 12 13 14 15  8  9 10 11 
21 20 23 22 17 16 19 18 29 28 31 30 25 24 27 26  5  4  7  6  1  0  3  2 13 12 15 14  9  8 11 10 
22 23 20 21 18 19 16 17 30 31 28 29 26 27 24 25  6  7  4  5  2  3  0  1 14 15 12 13 10 11  8  9 
23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24  7  6  5  4  3  2  1  0 15 14 13 12 11 10  9  8 
24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23  8  9 10 11 12 13 14 15  0  1  2  3  4  5  6  7 
25 24 27 26 29 28 31 30 17 16 19 18 21 20 23 22  9  8 11 10 13 12 15 14  1  0  3  2  5  4  7  6 
26 27 24 25 30 31 28 29 18 19 16 17 22 23 20 21 10 11  8  9 14 15 12 13  2  3  0  1  6  7  4  5 
27 26 25 24 31 30 29 28 19 18 17 16 23 22 21 20 11 10  9  8 15 14 13 12  3  2  1  0  7  6  5  4 
28 29 30 31 24 25 26 27 20 21 22 23 16 17 18 19 12 13 14 15  8  9 10 11  4  5  6  7  0  1  2  3 
29 28 31 30 25 24 27 26 21 20 23 22 17 16 19 18 13 12 15 14  9  8 11 10  5  4  7  6  1  0  3  2 
30 31 28 29 26 27 24 25 22 23 20 21 18 19 16 17 14 15 12 13 10 11  8  9  6  7  4  5  2  3  0  1 
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0 

The valAboveDiag method takes advantage of the fact the matrix is symetric about the main diagonal. For values below the diagonal, the row and column numbers are simply transposed. To compute an arbitrary value in an NxN matrix, the method executes O(log(N)). However, if I always intended to print the entire matrix, I could have rewritten this to take advantage of values already stored in the matrix. In that case, adding a new value to the matrix would occur in constant time. The central idea behind the code below, is that the matrix is built up of matrices whose sizes are powers of 2 (e.g. 2x2,4x4,8x8, etc), organized in a pattern [[A,B][B,A]]. For a given cell, above the main diagonal, the cell is either in quandrant [0,0], [0,1], or [1,1]. Because of the pattern [[A,B][B,A]], this means that quadrant [0,0] is A, quadrant [0,1] is B, and [1,1] is A. Observing that B=A+mid, where "mid" is the "middle value" midway between 0 and the maximum value of the power-of-two block, we can only need to be able to produce A to find any other value in the power-of-two block. This process simply recurses until A is found to be the identity matrix (i.e. [[0,1],[1,0]]).

package random;

import java.util.Arrays;

 * @author geoffreyhendrey
public class MadMatrix {
    int[][] a;

    public MadMatrix(int size){
        a = new int[size][size];
        for(int r=0;r<size;r++){
            for(int c=0;c<size;c++){
                a[r][c] = valAboveDiag(r,c);

    public String toString(){
        StringBuilder buf = new StringBuilder();
        for(int[] row:a){
            for(int column:row){
                buf.append(String.format("%2d ", column));
        return buf.toString();

    private int valAboveDiag(int r, int c) {
            int tmp = c;
        int order = getOrder(c);
        if(0 == order){
            if(r==c) return 0;
            return 1;
        int mid = 1<<order;
        if(c >=mid && r<mid){
            return valAboveDiag(r, c-mid) + mid;
            return valAboveDiag(r-mid, c-mid);

    private static int getOrder(int c){
        int order = 0;
        while(c > 1){
            c >>>=1;
        return order;

     * @param args the command line arguments
    public static void main(String[] args) {
        MadMatrix mm = new MadMatrix(32);


Sunday, April 04, 2010

AES Encryption in Java

I wrote this block of code a couple years ago, but it has served me well. I remember it being quite a challenge to get it to work properly. There was, and maybe still is a dearth of decent examples on using AES encryption from Java. The examples I found didn't deal with Cypher Block Coding (CBC). CBC is essential to decent encryption because it feeds the output of one cyphered block forward into the next, thus creating a cascade of randomness. All you need is an initial random phrase in the data you wish to encrypt. If you don't use CBC, you'll notice that blocks of your encrypted data can easily be recognized as separators in what you are encrypting. For example, let's say I encrypt the phrase "user:bob:password:abc". The substring "user:" and "password:" will cause an identifiable pattern in the encrypted data. If someone has access to snoop the wire, they can literally lift the encrypted username and password out of someone else's transmission, and paste it into their own. While CBC with a leading random phrase will make it impossible to identify such non-randomness, you still have to guard against someone injecting cruft into the encrypted data. This should be done by appending an MD5 hash of the unencrypted data, then encrypting the whole ball of wax.

    private static String encrypt(String encryptMe, byte[] cryptKey, SecretKey secretKey, byte[] iv){
        try {
            Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding", "SunJCE");
            cipher.init(Cipher.ENCRYPT_MODE, secretKey, new IvParameterSpec(iv));
            byte[] raw = encryptMe.toString().getBytes("ASCII");
            if(log.isDebugEnabled())log.debug("unencrypted bytes: " + raw.length);
            byte[] cipherText = new byte[cipher.getOutputSize(raw.length)];
            int ctLength = cipher.update(raw, 0, raw.length, cipherText, 0);
            ctLength += cipher.doFinal(cipherText, ctLength);
            if(log.isDebugEnabled())log.debug("ctLength: " + ctLength);
            if(log.isDebugEnabled())log.debug("cipherText Length: " + cipherText.length);
            //raw = cipher.doFinal(raw, 0, raw.length);
            byte[] copy = new byte[ctLength];
            System.arraycopy(cipherText, 0, copy, 0, ctLength);
            String encrypted = new String(new Base64().encode(copy), "US-ASCII");
            return encrypted;
        } catch (Exception ex) {
            throw new RuntimeException(ex.getMessage(), ex);

Friday, April 02, 2010

what's in your history?

[ghendrey@geoffunix hadoop]$ cut -f1 -d" " ~ghendrey/.bash_history | sort | uniq -c | sort -nr | head -n 20
    206 hadoop
     27 ls
     23 cd
     18 su
     17 grep
     15 tac
     11 pwd
     10 exit
      8 env
      5 less
      4 /mnt/hadoop/hbase-0.20.3/bin/hbase
      4 man
      4 ifconfig
      3 telnet
      3 start-dfs.sh
      3 source
      3 mount
      3 ant
      2 vi

Tuesday, February 23, 2010

Webscale Computing and Hadoop

I've been using Hadoop extensively for a month or so now at the office. I've been a big fan of Doug Cutting's Lucene technology, so when I heard that Doug Cutting was the guy behind, Hadoop, that pretty much pushed me over the edge and I started using Hadoop for building new types of database indexes from multi-gigabyte datasets.

Hadoop is an incredibly rich software stack consisting of three main pieces:
  1. The Hadoop Distributed File System (HDFS)
  2. A Map/Reduce infrastructure consisting of Daemons that layer services on top of HDFS
  3. A set of APIs allowing a data processing task to be split into simple Map and Reduce phases
There's a fourth piece, which is the many related tools (HBase, Hive, Pig, etc) which are layered on top of Hadoop.

If one was to read the description of map/reduce one might conclude that it is pretty much nonsense. In fact, it sounds to trivial to even be called an algorithm. Put things into groups, operate on the groups. Big deal. It sounds pretty much like common sense. Until you work with Hadoop, you really cannot appreciate all of the benefits that the Hadoop stack brings. It's really the collective benefits of the entire stack that make for the game changing experience that programming with Hadoop really is.

With problems of processing huge datasets, the devil is in the details. Hadoop provides a framework that removes all of the annoying complexity associated with big data sets. In essence you write two simple methods and from that point on you don't care much whether you are operating on 1 byte or 1 TB. This is possible in large part because HDFS allows you to think of terrabytes worth of data as a simple URL such as hdfs://namenode/user/geoff/BigDataSet. HDFS takes care of replicating your data, and figuring out where the data blocks actually reside in the cluster. As an added bonus, Hadoop automatically deals with nuances such as the files being zipped. But wait, there's more. From the hadoop command line, you can run commands to cat and grep these URLs, again, acting as if they were simple local files.

For me, one of the most interesting side effects of running a hadoop cluster has been how it changes ones ability to interact with peers on a webscale computing project. You can now shoot your co-worker a URL to a 200GB file, and they can poke and prod at the contents by doing things like "hadoop fs -text hdfs://namenode/user/geoff/BigData".

I'll try to write more about this later, but for now suffice it to say that Hadoop is quite exciting to work with. I think the criticism by Michael Stonebreaker have totally missed the point of what a good implementation of a map/reduce framework can yield, probably because their criticism focused on the map/reduce algorithm, which in and of itself is trivial. And that's really the point. It's really all about the tools and the entire stack that make map/reduce simply a "part of this complete breakfast" when it comes to hadoop. So don't forget the toast, juice, and yummy Cheerios!

Sunday, February 21, 2010

HATEOAS - Hypermedia As The Engine Of Application State

Roy Fielding's PhD dissertation on Architectural Styles and the Design of Network-based Software Architectures is the seminal paper on the RESTful style of web architecture. Before discussing the paper, let's first discuss what it is, because it seems quite unusual in its approach as a PhD dissertation. Reading this paper in the 21st century, one might conclude that the paper is an "experience paper" (as defined here). However, as noted, "An experience paper that merely confirms that which is well known as opposed to that which is widely believed is of little value." However, in the early 1990's, the topic of the dissertation was probably not widely known. The irony now, is that as the term REST becomes more and more commonplace, and is applied to non-RESTful systems, the level of confusion over REST might be rising, not falling.

I was certainly confused for a long time about what constitutes a RESTful system. Isn't anything that has a URL RESTful? Definitely not, as the presence of resource identifier is just one of several properties that make a system RESTful. Fielding has recently written this article listing the properties he sees as essential to a RESTful system. After working quite a bit on the RESTful system for NextDB.net, it seems that me that the principle of Hypermedia As The Engine Of Application State (HATEOAS), identified on the original dissertation section 5.1.5, might be the most important principle of all. Incidentally, of all the properties outlines in the recent article, Fielding chose to title the article "REST APIs Must be Hypertext Driven". To my mind, that confirms my hunch that HATEOAS is perhaps the least understood principle of REST, and the principle in need of the most discussion and explanation.

The essence of HATEOAS is that the client needs to understand as little as possible. In essence, this is quite similar to the experience a human requires when he browses a website. In order to move from one page to the next, we must be given links that we can navigate. This is our common experience on web pages. Consider an alternative, and less user friendly,  scenario in which the home page simply consisted of a set of instructions (an API, if you will). One instruction might read "To view articles about sports, visit a URL consisting of "http://sitename/topics?name=topic-name" and replace topic-name with 'SPORTS'". You would be forced to manually construct the URL and enter it into the browser. The notion of such a website seems absurd, and yet it describes the scenario common to RPC or "fake REST" APIs in which the client is responsible for constructing the state-transition links rather than being handed a hypermedia document that provides the available state transitions. 

Quoting Fielding: "...all application state transitions must be driven by client selection of server-provided choices that are present in the received representations or implied by the user’s manipulation of those representations".

This is an interesting article that shows how non-HTML hypermedia can be used in keeping with HATEOAS. However, I think there are excellent reasons why Plain Old HTML (POH) is the best hypermedia choice for use with RESTful systems:

  1. HTML was made for hypermedia (it has well known structures for encoding links)
  2.  Your mom knows what HTML is
  3. By using POH for HATEOAS, your RESTful system's content can be indexed by search engines
  4. You can easily view and traverse your system states using a web browser
POH for HATEOAS is the approach taken by NextDB's REST system. In the documentation, we do describe the explicit structure of several of the URL's, but we stick strongly to the principle outlined by Fielding that, beyond the entry point URL, or "bookmark" you shouldn't need to know the structure of any of the URLs. Using POH for HATEOAS, the proof of NextDBs' adherence to this principle is quite straightforward: I simply give you an entry point URL (a bookmark) to a table and allow you to navigate the links served in the POH. For example, here is the bookmark to an account named 'geoff' with a database named 'testchars' and a table named 'lines':

From the URL above you can click on links to visit various pages of the content and sort the content. No knowledge of an API is required. Taken to its logical conclusion, I see no reason why the RESTful service should not be allowed to directly present styled content, when the consumer of the content is human. So that is exactly what we do in NextDB, by allowing the POH to include CSS. Here is an example:


We're techically violating HATEOAS because we don't provide links to the styled content, rather the developer has to know about the available styles by reading our documentation. However, we'll soon correct this, as well as allowing the developer to pass in links to his own CSS for inclusion in the returned POH.

Finally, not only do we allow you to style the POH, but we also allow you to apply XSLT transformations to the POH in order to alter the structure (as opposed to style) of the POH. Fielding discusses "Processing Elements" or "Transformers of Data" in his thesis. I believe XSLT to be the POH analog for processing elements, in that they are well understood, easily encapsulated in a markup, and supported by a wide array of processors (including browser-side processing, although this tends to be less portable, which is why NextDB opted to perform the transformation on the server).

In summary, NextDB is a truly RESTful database. This is important not because it conforms to a buzzword, but because it has made NextDB so easy to use that even non-programmers are able to embed the POH hypermedia in their sites. Project Have Hope is a great example of one such site. The "catalogs" of sponsorable children and women in Africa is POH served out of NextDB.net.  The HATEOAS architecture is the key to opening the database content to CMS engine-driven sites, as well as accessibility to web search engines.