Wednesday, October 05, 2016

Tuesday, October 13, 2015

Netbeans for C++

I've been using Netbeans for C++ development. One of the big issues I had is getting the Netbeans Code Assistance feature to work properly. Meaning, I want:

  • syntax highlighting
  • code completion
  • refactoring
  • call graphs
In short, everything I expect from Netbeans when using Java. I am dealing with a large CMake based project, with existing sources. So I build the project manually, and CMake emits the Makefiles. Now I have a project that netbeans can open (an "existing Makefile based project"). The next thing you need to do is set your include directories (shown below). 

Now this is where things get weird. Despite having set the includes several times none of the Code Assistance features worked. I tried deleting the NB cache, restarting, etc, etc. Nothing worked. If I write click on the project, and select Code Assistance I can dump the diagnostics. Notice how the user include paths that I entered ARE NOT PRESENT? WTF?


 from project main [/Users/ghendrey/git/main]
Lang=CPP Flavor=CPP excluded=false
User Include Paths:
User Macros:
System Include Paths:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1
/usr/local/include
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/6.0/include
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include
/usr/include
/System/Library/Frameworks
/Library/Frameworks



But, my user includes are not showing up! Finally, I tried Project > Properties > General and added my "src" folder to the "Project Source Folders". Now this seems like a bit of a "duh", but it wasn't obvious, since "." was already listed under project source folders, which presumably grabs everything. Furthermore, once I save the properties, and close the dialog, the "src" folder disappears, and only "." is shown. HOWEVER, it seems that adding "src" somehow kicked started the Code Assistance. Now  I see the correct user includes when I dump the diagnostics:

 from project main [/Users/ghendrey/git/main]

Lang=CPP Flavor=CPP excluded=false
User Include Paths:
/Users/ghendrey/splunk/current/include
/Users/ghendrey/git/main/src
/Users/ghendrey/git/main/src/libzero
/Users/ghendrey/git/main/src/util
User Macros:
System Include Paths:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1
/usr/local/include
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/6.0/include
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include
/usr/include
/System/Library/Frameworks
/Library/Frameworks

And lo and behold, all of code completion is working. Incidentally, the Code Assistance right click menu is often broken (several of the items become unelectable until I close the project and restart netbeans). So, if you can deal with these annoyance, once you (finally) get up and running with code assistance in netbeans, it is a way way better c++ development environment than eclipse. 



Friday, October 09, 2015

Why oh why did I just waste the day on this shit?

I just missed tucking my daughter in because I've been trying to wrangle Jersey's Moxy into working properly. Jersey guys: you have done great work, but this Moxy thing is a PIECE OF SHIT. I'm sorry, but I am tired and grumpy. I have used Jackson so successfully with Jersey in the past, but I wanted to believe your docs on Moxy. I wanted to believe it would "just work". Boy...moxy sucks. I should have just stuck with Jackson. I'm so pissed off I need to write a silly blog post to calm down. What really pisses me off, is every single thing I was trying to make work with Moxy, just immediately worked when I switched to Jackson.

Thank god for Stack Overflow: http://stackoverflow.com/questions/29322605/how-to-return-a-json-object-from-a-hashmap-with-moxy-and-jersey

I should have listened to the answer better. Apparently the answer to how to get Moxy to work is not to use Moxy.: "For this reason, and actually many others I have encountered in the past, I do no recommend using MOXy even though it is recommended by Jersey. Instead, use Jackson. Jackson has been the defacto Java goto for JSON processing for a while. For Jackson, just use this dependence".

Another annoyance in this whole fiasco has been that the interwebs are littered with descriptions of how to use ResourceConfig...but there are multiple versions of ResourceConfig. What the fuck?

https://jersey.java.net/apidocs/2.4.1/jersey/org/glassfish/jersey/server/ResourceConfig.html
https://jersey.java.net/apidocs/1.1.5/jersey/com/sun/jersey/api/core/ResourceConfig.html

They have different interfaces. So you'll find a snippet like this, which pertains to the old 1.1.5 API. Thus totally confusing the fuck out of you: http://stackoverflow.com/questions/9490123/how-do-i-enable-pojo-mapping-programatically-in-jersey-using-grizzly2

Moxy blows. Jackson rules. Punk is out. Rap is in.

ActiveMQ: Your "Hello World" really should work...

I just downloaded ActiveMQ and started running the basic Publisher/Listener example. Now, sort of the first thing I would expect from this example would be that the count of received messages would be accurate. But alas, it seems all too common, for Hello World not to work.  Sure enough, Publisher publishes 10000 messages, Listener reports "Received 10001 in 1.66 seconds". One extra message! WTF. Ahh, I see, you're loop count is broken. Really?


Monday, August 10, 2015

Debugging deadlocks with Java's ReentrantLock


I recently experience a situation where my program just seemed to "stop". My first instinct was to use the Netbeans "check for deadlock" feature. Unfortunately, when Netbeans told me "no deadlock found". I still thought it was a deadlock. The app created dozens of threads, and the Netbeans debugger has a really crummy bug where the vertical slider repositions the threads viewport every time you pause a thread. This made it very difficult to see what was going on, but after a lot of clicking around I seemed to be able to determine that there was a thread waiting on a ReentrantLock. Every time I un-suspended the thread, it stayed at exactly the same place. Obviously someone was holding the lock. With the debugger, I could see a field inside the ReentrantLock called "exclusiveHolder" (or something like that), and by looking at the char array for the field, I could see the name of the other thread. Indeed this other thread was sitting trying to enter a synchronized method. From the call stack of the first thread I could see that the first thread already had entered the synchronized method. So it was a deadlock. To pin it down with certainty, I added these debug statements before attempting lock() on the lock instance.:

log.warn("attempting increment LOCK: " + lock +" w/ queue len  "+lock.getQueueLength());

 I also added this log4j configuration so that I could see the calling thread.


Fortunately, the toString method of ReentrantLock also prints the current owner of the lock.

# Set root logger level to DEBUG and its only appender to A1.
log4j.rootLogger=WARN, A1

# A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.ConsoleAppender

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-4r [%t] [line=%L] [%M] %-5p %c %x - %m%n

The "%t" will show the calling thread. 




Thursday, July 16, 2015

Akka Respond When Ready Pattern (pause processing of messages)?

Perhaps someone has already solved this, or perhaps it is so simple what I've done is common sense. I am too new to Akka to know for sure. So here is the scenario:

  • An Actor we'll call Server needs to bootstrap itself. This takes some time.
  • Meanwhile, an Actor we can call Client begins to send Request messages to the server
  • ...but the server isn't ready yet.
The punchline, put long running initialization in preStart(). http://stackoverflow.com/questions/17061740/how-to-deal-with-long-initialization-of-an-akka-child-actor since messages will happily collect in the mailbox.
So, let's look to the more general case of what do do if an actor becomes unavailable for some reasonable period of time, not just at initialization. Here are the options I considered:
  • Server can discard messages any time it isn't available to respond (such as the bootstrap scenario)
    • Requires clients to wait a certain amount of time for a timeout, then resend.Requires client to keep track of/queue messages
  • Server can buffer messages when it is unable to respond, then handle them when ready
 I like the last solution the best. I browsed the akka docs on the thought that maybe they had something that does this. Low and behold: ReliableProxy is already available from Akka. At first I thought this was what I needed.
But I immediately realized that the message between E to B in the picture above suffers exactly the same problem as the message from A to B directly. So this is of no use. Therefore I implemented serverside buffering of messages during times it is unable to respond. When it is able to respnd it dequeues the messages and handles them. This is better than any option in which the client needs to take server "sleepiness" into consideration. I came across this stack overflow topic which affirms my design choice: http://stackoverflow.com/questions/9602322/akka-2-how-to-pause-processing-of-messageshttp://stackoverflow.com/questions/9602322/akka-2-how-to-pause-processing-of-messages

Friday, June 26, 2015

First reaction to Akka

Akka looks awesome. But never have I seen a more overdocumented framework. Every piece of docs is an exposition. Way too hard to find hello world. Most people today don't read docs top to bottom. We google "build akka" ...and we find this: http://doc.akka.io/docs/akka/snapshot/dev/building-akka.html

Any mention here of Maven being supported? Not that I saw. So I spent a while thinking I had to learn SBT. Then I found this page, which right at the top shows how to build with maven:

http://doc.akka.io/docs/akka/snapshot/intro/getting-started.html

But...WTF...the pom.xml they ship with doesn't include a connection to their repo! Have to add this:
Next we try to run some akka bins: ghendrey-mbp:akka-2.3.11 ghendrey$ ./bin/akka-cluster -bash: ./bin/akka-cluster: /bin/bash^M: bad interpreter: No such file or directory 

(yup, every bin script is borked with windows line returns and won't run)

LOL.

So how about we pop off the stack and try another approach:  http://doc.akka.io/docs/akka/2.0.1/intro/getting-started-first-java.html
 ...Follow the instructions..."
ghendrey-mbp:NetBeansProjects ghendrey$ cd akka/akka-tutorials/akka-tutorial-first
-bash: cd: akka/akka-tutorials/akka-tutorial-first: No such file or directory 
 
So after fixing the line breaks, fixing pom.xml, finding the right docs, finally hello world runs. Yeah. Now I try a more complex example.
Does it work?? NOPE. Exception spray:

 
 

Netbeans for c++

I have abandoned eclipse for c++ development on mac. Its "indexing" of large code basis was insanely slow (hours!!), and every now and then it would maddeningly begin redlining things again. I spent too many hours on it. It also can't debug with the current version of GDB. Spent so many hours in hell, compiling ancient versions of GDB. Ultimatly abandonded ship. The Eclipse UI was also vastly inferior to that of Netbeans. Just one of the irks is that the dropdown for open files doesn't display the files alphabetically. That's absurd. Then SWT look and feel, despite all the claims, isn't any faster than the Netbeans Swing-based UI. Does it look better? Nope. The icons in eclipse are outdated and the whole thing feels oldschool.

So...more or less happily  back to Netbeans. It has only one problem for c++ development. For some reason all method on STL classes, such as vector::push_back cannot be resolved and are redlined. I am willing to live with this since callgraph etc is all very fast. The same debugging issue exists on Netbeans as on eclipse but on Netbeans I found this workaround. If I run GDB server, using a netbeans plugin I can connect to GDB server and debug. The core problem you hit in both NB and eclipse is that the debugger will freeze on the first line you break at (not for any trivial hello world of course. But on every real project I actually need to work. Lot's of other people have run into it).

I speculate that it is some kind of race condition. By using GDB server, the connection to the breakpoint avoids the race.

Wednesday, November 19, 2014

Getting rid of redlines in eclipse c++ with Makefile project

I created a new eclipse project for c++, using an existing makefile. The project sources built without error when I hit build (it ran the makefile, no errors). But eclipse listed hundreds of errors in its "errors" panel and my code was covered in redlines. None of the standard suggestions worked, such as adding /usr/include to Paths and Symbols. Finally I discovered the Preprocessor Include Paths dialog. Click on the Providers tabby-thing, and select the radio buttons shown below.
The other thing I realized is that if you have  a large code base, the eclipse code indexing simply craps out, usually with an OutOfMemoryException. We have 1500 C++ files in our code base and thousands of headers. The indexer would hang Eclipse every time. Therefore, under Paths And Symbols > Source Location, I select only the current source directory I am working on, vs selecting the entire source root. No more redlines. Yeah.

Tuesday, May 06, 2014

A new post after a long hiatus

It's been years since i posted on this blog. The reason I stopped is that Google was erroneously flagging this blog as a security risk. This happened because I was posting cross-domain AJAX code and widgets directly into my blog and sidebar, using nextdb.net. Even after methodically removing them all (which ruined some cool elements of my blog), Google still refused to correct the classification of this blog. This really irked me and I said "to heck with blogger.com". After many years I took another look at my dormant blog, and I see Google is no longer flagging. Well, maybe I will start posting again here.

Wednesday, August 10, 2011

welcome...to...big...data


The following is excerpted from the hbase-user@apache.org mailing list...an instant classic!
=================================================================
Mongodb does an excellent job at single node scalability - they use mmap and many smart things and really kick ass ... ON A SINGLE NODE.

That single node must have raid (raid it going out of fashion btw), and you wont be able to scale without resorting to:
- replication (complex setup!)
- sharding

mongo claims to help on the last item, but it is still a risk point.

For really large data that must span multiple machines, there is no "clustered sql" type solution that isnt (a) borked in various ways (Oracle RAC I'm looking at you) or (b) stupid expensive (Oracle RAC, STILL looking at you)

Tools like HBase give you scalability at the cost of features (no automated secondary indexing, no query language).

Welcome... to... big... data.

-ryan

On Thu, Aug 11, 2011 at 12:44 AM, Edward Capriolo <> wrote:
> On Wed, Aug 10, 2011 at 4:26 PM, Li Pi <> wrote:
>> You'll have to build your own secondary indexes for now.
>> 
>> On Wed, Aug 10, 2011 at 1:15 PM, Laurent Hatier
>>
>> >wrote:
>> 
>> > Yes, i have heard this index but is it available on hbase 0.90.3 ?
>> >
>> > 2011/8/10 Chris Tarnas <>
>> >
>> > > Hi Laurent,
>> > >
>> > > Without more details on your schema and how you are finding that
>> > > number
>> > in
>> > > your table it is impossible to fully answer the question. I
>> > > suspect
>> what
>> > you
>> > > are seeing is mongo's native support for secondary indexes. If
>> > > you were
>> > to
>> > > add secondary indexes in HBase then retrieving that row should be
>> > > on
>> the
>> > > order of 3-30ms. If that is you main query method then you could
>> > reorganize
>> > > your table to make that long number your row key, then you would
>> > > get
>> even
>> > > faster reads.
>> > >
>> > > -chris
>> > >
>> > >
>> > > On Aug 10, 2011, at 10:02 AM, Laurent Hatier wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > I would like to know why MongoDB is faster than HBase to select
>> items.
>> > > > I explain my case :
>> > > > I've inserted 4'000'000 lines into HBase and MongoDB and i must
>> > calculate
>> > > > the geolocation with the IP. I calculate a Long number with the
>> > > > IP
>> and
>> > i
>> > > go
>> > > > to find it into the 4'000'000 lines.
>> > > > it's take 5 ms to select the right row with Mongo instead of
>> > > > HBase
>> > takes
>> > > 5
>> > > > seconds.
>> > > > I think that the reason is the method : cur.limit(1) with
>> > > > MongoDB but
>> > is
>> > > > there no function like this with HBase ?
>> > > >
>> > > > --
>> > > > Laurent HATIER
>> > > > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
>> > >
>> > >
>> >
>> >
>> > --
>> > Laurent HATIER
>> > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
>> >
>> 

Sunday, January 23, 2011

HBase 0.89 Bulk Import capability

I recently posted about my experience tuning HBase for mass inserts. When I learned that HBase actually supports a way to *avoid* the insert all together, I got really excited. As the Joshua Program says, "the only way to win the game is not to play"! What better way to speed up inserts than not to do them at all! You can't get much faster than zero. So how do you get data into HBase if you don't do inserts? Aha! You can actually generate HFileOutputFormat files, that essentiallly *are* database tables. And believe me, compared to running Puts from your reducer it is fast, and you don't have the "attack my database and maybe crash it" issue. To give you an idea of how fast it is, I had a job whose reducers ran for 45 minutes doing Puts. The bulk loader took 19 seconds to load the files, and it only took a couple minutes to generate the files from the reducer. So the whole bulk load process, end to end, was maybe 2 minutes.

The main "gotchas" along the way: The documentation isn't entirely clear about the fact that if you use HFileOutputFormat.configureIncrementalLoad, that your reducer is silently replaced by an internal HBase class that insures total ordering. Secondly, your mapper must output either Put or KeyValue. This is not a usable way to actually write mapreduce jobs (since you lose the ability to pass domain objects between map and reduce). Therefore, I found it best to alter my existing mapreduce jobs to produce a SequenceFileOutputFormat, in which I wrote Put from my reducer as the output class (since it doesn't interfere with your job logic/classes at all to change only the *final* output class to a Put). In fact, the class I wrote to generate the HFileOutputFormat will actually accept sequence files as input if they contain either a Put, or a single column value. I included support for a single column value just 'cause I had a couple jobs lying around that produced only a single column, and would have been a hassle to convert them to producing Puts. Then my generic bulk loader mapreduce job can be run on any of my output sequence files.

public class BulkLoader extends Configured implements Tool{


    public static class BulkLoaderMapper extends Mapper<WritableComparable, Writable, ImmutableBytesWritable, Put> {

        protected long version;
        protected String column;
        protected String family;

        @Override
        protected void setup(Mapper.Context context) throws IOException, InterruptedException {
            this.version = context.getConfiguration().getLong("version", System.currentTimeMillis());
            this.column = context.getConfiguration().get("column");
            this.family= context.getConfiguration().get("family");
        }


        private byte[] marshal(Writable o) throws IOException{
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            DataOutputStream dos = new DataOutputStream(bos);
            o.write(dos);
            return bos.toByteArray();
        }

        protected byte[] getHBaseRowKeyBytes(Writable k) throws IOException{
            return marshal(k);
        }

        protected byte[] getHBaseFamilyNameBytes(){
            return Bytes.toBytes(family);
        }

        protected byte[] getHBaseColumnNameBytes(){
            return Bytes.toBytes(column);
        }

        protected byte[] getHBaseValueBytes(Writable v) throws IOException{
            return marshal(v);
        }

        @Override
        protected void map(WritableComparable key, Writable val, Context context) throws IOException, InterruptedException {
            //KeyValue kv = new KeyValue(getHBaseRowKeyBytes(key), getHBaseFamilyNameBytes(), getHBaseColumnNameBytes(), version, getHBaseValueBytes());
            if(context.getConfiguration().get("input-format", "column").equals("put")){ //the serialized file actually is full of puts
                System.out.println(key.toString());
                Put p = (Put) val;
                //I am baffled as to why I cannot use getHBaseRowKeyBytes(key) here. It seems to not produce the correct key.
                //Therefore I m using p.getRow() to get the rowkey bytes.
                context.write(new ImmutableBytesWritable(p.getRow()), p);
            }else{
                Put put = new Put(getHBaseRowKeyBytes(key));
                put.add(getHBaseFamilyNameBytes(), getHBaseColumnNameBytes(), version, getHBaseValueBytes(val));
                context.write(new ImmutableBytesWritable(getHBaseRowKeyBytes(key)), put);
            }
        }
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new BulkLoader(), args);
        System.exit(exitCode);

    }

    @Override
    public int run(String[] otherArgs) throws Exception {

        Configuration config = (Configuration) getConf();//new Configuration();
        Job job = new Job(config);

        //job.setOutputKeyClass(ImmutableBytesWritable.class);
        //job.setOutputValueClass(Put.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
        job.setMapperClass(BulkLoaderMapper.class);

        job.setJarByClass(BulkLoader.class);
        job.setInputFormatClass(SequenceFileInputFormat.class);
        System.out.println("sequencefile-input: " + config.get("sequencefile-input"));
        System.out.println("output-path: " + config.get("output-path"));
        System.out.println("bulk-load-table: " + config.get("bulk-load-table"));
        System.out.println("family: " + config.get("family"));
        System.out.println("column: " + config.get("column"));
        System.out.println("input-format: " + config.get("input-format", "column"));
        SequenceFileInputFormat.addInputPath(job, new Path(config.get("sequencefile-input")));
        job.setOutputFormatClass(HFileOutputFormat.class);
        Configuration hConfig = HBaseConfiguration.create(config);
        hConfig.setLong("version", System.currentTimeMillis());
        hConfig.set("hbase.zookeeper.quorum", config.get("zk", "zk1.x.com, zk2.x.xom"));
        job.setJobName("Bulk Loading table: " + hConfig.get("bulk-load-table","YOU HAVE NOT SET bulk-load-table PARAMETER"));
        HFileOutputFormat.setOutputPath(job, new Path(config.get("output-path")));
        HFileOutputFormat.configureIncrementalLoad(job, new HTable(hConfig, config.get("bulk-load-table")));


        job.waitForCompletion(true);

        return 0; //normal exit

    }
}

Finally, the 0.89 jar seemed to have some packaging problems. I had to create a "lib" directory (used by hadoop in job jars) and rejar the 0.89 jar with guava.jar and a perhaps a couple other missing dependencies. With that done I was able to run the bulkloader with the completebulkload option, using ant command:

Saturday, January 08, 2011

Facebook announces websearch to rival Google's



Admittedly I've made up the headline "Facebook announces websearch to rival Google's", but is it really such an outlandish proposition? It's certainly attention grabbing, and why not? Google's websearch has basically remained the same product for 10 years. Any engineer at Google would debate this, but I doubt anyone would debate that the engineers at facebook have achieved a stunning mastery over big data sets and management of the social network. 
Based on the considerable "Big Data" and real-time data expertise at Facebook, would anyone really be surprised if such a headline surfaced? I'd find it quite a believable story: "Facebook announces intention to best Google at web search." It's hard to overstate the earthquake that such an announcement would send rumbling through silicon valley, and the entire world. In fact, what better way for Facebook to spend their $500 million in Goldman Sachs cash then acquiring promising engineers and technologies to directly attack Google's biggest Web cash cows: search and maps. Bloodying the nose of the competition by competing with their core value has always been Google's way: If an Operating System is of value, Google builds an OS. If mobile computing is of value, google builds a mobile OS, and it's own phone hardware. The day is going to come when a Google rival will compete with Google on Google's home turf: web search.

Tuesday, January 04, 2011

Minhashing is reaaally cool

What is the Jaccard Coefficient? Answer: one of the most important concepts in information retrieval and theory of sets. The Jaccard Coeffieicnt is very simple. It is the measure of the fractional similarity of two sets. Let's take  a simple example {A,B,C,D,E,F} vs. {B,E,F}. Fractional similarity of these sets is 3/6. Easily you can see that 3 elements are shared (the intersection is {B,E,F}), whereas the total number of items in both sets (the union) is 6 elements ({A,B,C,D,E,F}).

So without any mumbo-jumbo, we can see the Jaccard  Coefficient  (JC) is just the ratio of sameness (intersection) over the total amount of stuff present. Technically it's the cardinality (count) of elements in the intersection over cardinality (count) of elements in the union. So while the JC between {A,B,C,D,E,F} and {B,E,F} is 50%, we can see that the JC between {A,B,C,D,E,F,G,H,I} and {B,E,F} is only 33.33%. This illustrates the importance of the total amount of stuff present. The more irrelevant (non-intersection) stuff is present, the lower the Jaccard Coefficient.


 Now pretend we have two bags. Bag 1 contains Set 1, and Bag 2 contains Set 2. I tell you the JC between these two sets is 95%. I reach into Bag 1 and I pull out the element "M". What is the probability that the element "M" is also in Bag 2? It's 95%. That's what the Jaccard Coefficient tells me. But now here is something quite amazing. What is the probability that the lowest symbol (lowest according to some ordering, like a hashcode) in Bag 1 is the *same* letter as the lowest letter in Bag 2. Again, it's 95%. Why? Some letter has to be the lowest one in a bag. Since only 5% of the letters in the other bag ought to be different, then there is a 95% chance that the lowest symbol in the second bag falls in the intersection of the sets, and is consequently the same.



Now it really gets interesting. If I group sets by a key, where the key is the lowest element of the set, then the probability that two sets get grouped by the same key is again the Jaccard Coefficient. That's cool. That's BADASS. Why? It's an almost shockingly simple way to create clusters of similar sets.

I find a lot of convoluted explanations of Minhashing, but you really don't need to know more than this guy's explanation. Here's a short piece of code, that given a sentence, splits the sentence into words, then builds N-grams out of the words (the N-grams are the set elements). Surrounding a word with ^ and $ is a well-known way for demarcating N-grams that are beginning and endings of words. A SHA-1 cryotographic hash function is a nice way to assign a random, but consistent ordering to any possible N-gram we might come up with. As suggested by this guy, rather than clustering on the *single* lowest element, we can tighten the clusters up by making the cluster key the numMins smallest elements, concatenated.  I use this in a mapreduce job, where I write the minhash out as the key from the mapper, and the reducer collects all the sentences with the same minHash key, thereby clustering similar sentences. And lo and behold...it works.

      protected String minHash(String sentence, int nGramLength, int numMins) throws Exception{
            MessageDigest md = MessageDigest.getInstance("SHA-1");
            PriorityQueue queue = new PriorityQueue();
            for(String part :sentence.split("\\s")){
                for(String gram:getNGrams("^"+part+"$",nGramLength)){
                    md.reset();
                    queue.add(ByteBuffer.wrap(md.digest(gram.getBytes("UTF-8"))).getLong());
                }
            }            
            StringBuilder minHash = new StringBuilder();
            while(null != queue.peek() && numMins > 0){
                minHash.append(queue.remove());
                --numMins;
            }
            return minHash.toString();
        }.

Sunday, December 26, 2010

Tuning HBase and Mapreduce for Mass Inserts

For several months my colleagues and I have been earning our battle scars as we run progressively larger bulk inserts into HBase. I'm going to try to recall as many of the lessons learned as possible. At the time of this writing HBase 20.6 is the latest production release:

  1. Use  HBase 20.6. We had been running 20.3 and it suffered from a hideous deadlock bug, that would cause clients to simply stall and do nothing, until they were killed by the Hadoop task tracker. Even worse, because they were killed and restarted, this caused a vicious cycle of reducers stalling, restarting, stalling, etc. All the while threads and client connections were eaten up in HBase.
  2. Use huge numbers of mappers, and huge numbers of reducers so that your data sets are broken into bite sized pieces. To do this, set mapred.tasktracker.[map|reduce].tasks.maximum (the max value that will be run by a single task tracker at once) to a value correlated to the number of cores on each box (we'll cal this "C") and the number of boxes "B". Then set set mapred.[map|reduce].tasks (the total number of mappers or reducers that will get run) equal to a*C*B. "a" is then a free parameter which controls how big your bite sized batches of data will be. As a rule of thumb, if you are reading data out of HBase via TableMapper, I try to make a*C*B between 1000 and 10000, but it all depends on how much processing you ned to do on each chunk.
  3. This is an addendum to rule (2). All sorts of bad things happen when your mappers or reducers bite off too much data, and run for too long. Just a few items on the buffet of errors and warnings that you'll chase if your mappers or reducers try to handle too much data:
    • HBase clients leases expiring
    • Chance of OoME
    • Failure to report status for more than 600 seconds causes task to get shot in the head
    • Hbase client slows down on writes(annecdotal. I can't prove this, but I know that if you want to avoid ptential slowdowns, then following rule 2, in my experience, helps a lot)
  4. Use fine grained status. Make calls to progress(), setStatus(msg), and use counters to track what your job is doing. Nothing is more annoying that not being able to tell what the hell your job is doing when it has been running for 5 hours. Rule 2 helps here too. The more mappers and reducers you have, the more continuous and fine grained will be your status reporting as each task completes with its data load into HBase.
  5. Don't screw around too much with the default settings. Ryan Rawson has put out a good set of defaults in this PPT. You can turn these dials ad infinitum but I can pretty much promise you that there is no magic set of settings. This can be a massive time sync. Stick with Ryan's recommendations.
  6. Watch out for problems with your hostnames. Let's say your hostname is "s1.x.com". If you execute "hostname" and all you get back is "s1" or "localhost" you are in for problems. Check your /etc/hosts and make sure that aliases like "s1" don't come before the fully qualified hostname.
  7. Watch out for msiconfigured NICs. Problems with vanishing packets sent me chasing my tail for weeks. This was the classic buffet of errors, with everything seemingly leading to no resolution.
  8. Disable the Write Ahead Log from your client code. Use an appropriately sized write buffer. There'sa point of diminishing returns here. Exceeding the size of the serverside buffers won't do you any good.
  9. Don't try to get too smart. Most bulk HBase data-producing M/R jobs are going to do some processing, then write the data from the reducer. Since the reducers all receive keys in the same order, this causes all the reducers to attack the same HBase region simultaneously. We had this "great idea" that if we reversed the keys that we wrote out of our mapper, then un-reversed them in the reducer, that our reducers would be randomly writing to different region servers, not hitting a single region in lock step. Now, I have some theories on why this seemingly innocuous approach repeatably destroyed our entire Hbase database. I won't wax philosophical here, but one thing is certain. Any table created via batch inserts of randomized keys got totally hosed. Scans became dirt slow and compactions ran constantly, even days after the table was created. None of these problems made a whole lot of sense, which is why it took 3-4 weeks of debugging for us to back this "key randomizing" out of our code. The hosed tables, actually had to be dropped for the problem, and ensuing chaos to totally abate.
  10. tail all your logs and scan them assiduously for errors and warnings. I use VNCserver with different desktops for tails of all the logs so that I never have to "re-setup" all my tailf commands when I log in and out of the server.
  11. Disable speculative execution. You don't want two reducers trying to insert the exact same content into the database.




Monday, November 29, 2010

Black Friday Buys


Picked up a wireless access point for 10 bucks on buy.com. Placed it at the highest point in the house, and set it up to pull it's own IP address from DHCP off my existing wireless access point, so it's basically a range extender. For a $10 "TRENDnet", I was really impressed at how easy it was to setup. I tried the same thing with a Linksys several months back. The linksys would work on and off and just periodically die, and it cost something like $50. Anyway, I'm posting from my living room and receiving a good signal, so definitely worth the $10.

Saturday, November 27, 2010

NoSQL: Quite Possibly The Stupidest Catch Phrase Ever?

A recent article on Gigaom is titled "Why cloud computing sells and NoSQL is fading".  I'd like to parlay that discussion into one of "why NoSQL is quite possibly the stupidest catch phrase ever". Here's where I'm coming from. Clearly the "NoSQL" moniker arose from a desire to describe systems that avoided the scalability limits of relational databases. But from the get-go, the term is just...STUPID. First of all, SQL is a query language, and there are relational systems that don't use SQL. So, "NoSQL" isn't even a good way to say "non-relational". Secondly, defining systems in terms of what they *don't* do means that dozens of systems that shouldn't get grouped together, do get grouped together. Corn: NoSQL! Windbreakers: NoSQL!

Adding the confusion was this classic: Map Reduce: A Major Step Backwards.  It's amazing that this was written by Michael Stonebreaker, but this just goes to show you that even a genius can write a ridiculous article when he's writing for a PR blog. Then there was the 180 degree follow up (ooops, how stupid do we look?): Reaffirming Our Commitment to Hadoop and Mapreduce.

Honestly, the entire NoSQL thing is, to me, a mark of short sited PR as well as a failure to recognize that SQL databases can and must peacefully co-exist with systems designed to tackle other aspects of scalability and data processing. When Hadoop and Mapreduce get thrown in with "NoSQL" it especially drives me nuts, because Hadoop and Mapreduce are not even data retrieval or management systems! C'mon folks, Hadoop is for data processing. And there are lots of reasons to take the output and put it in a relational database. 

So, I was happy to see the Gigaom article. The NoSQL emperor has no clothes. It's a stupid catch phrase, and a clear indicator that the person using it is clueless.




Sunday, October 24, 2010

SPIRE 2010 Los Cabos

I recently attended the String Processing and Information Retrieval conference (SPIRE 2010). In terms of content, this conference far exceeded my expectations. Excellent presentations on the latest research. Stuff I'd *never* even heard of: Wavelet Trees for example. Very friendly presenters, who enthusiastically spent time with me between presentations to thoroughly teach some of the concepts.










 
 Yahoo! Research presents Time Explorer











Microsoft Research Keynote








No complaints about the venue either. Met so many cool and interesting researchers in the field of IR (which is basically "search").

Java SimpleDateFormat: simply a waste of Time

I've now sunk probably 8 hours  into trying to make a minimally robust date parser that can handle dates in a variety of formats. Sadly, I'm afraid this is the best I've been able to do, given the fragility of the Java SimpleDateFormat class. I've composed a little framework in two pieces. The first piece is a properties file that defines the formats that the user will be allowed to enter dates in. The second piece is the code that jumps through an absurd series of hoops in order to work around bugs and shortcomings in SimpleDateFormat. If anyone knows a better way, that's been *tested*, please, for the love of GOD, post a comment here.

DateFormats.properties

format1=yyyy-MM-dd HH:mm:ss.SSS z
format2=yyyy-MM-dd HH:mm:ss.SSS
format3=yyyy-MM-dd HH:mm:ss
format4=yyyy-MM-dd HH:mm
format5=E, MMM dd, yyyy
format6=MMM dd yyyy
format7=MMM dd yyyy HH:mm
format8=MMM dd, yyyy
format9=MMM. dd, yyyy
format10=MM/dd/yy
format11=MM/dd/yyyy
format12=M/d/yy
format13=yyyy-MM-dd
format14=yyyy-MM-dd, E
format15=yyyy MMM. dd, HHh 1mmm ss's'
format16=yyyy/M/dd/HH:mm:ss
format17=yyyy-M-dd'T'HH:mm:ss


Notice in the formats above that all months and days are defined with two format chars. E.g. "MM" or "DD". Why not "M" or "D"?  SimpleDateFormat format strings are very rigid. We want the user to be able to enter 10/1/07 or 10/01/07. But I don't want to clutter the properties file with single and double format chars. For this reason, the code automatically pads any single digits with zero. So "1" in isolation is converted to "01". So if the user enters "Jul 7, 2001" the string is automatically zero-padded to "Jul 07, 2001". This allows the properties file to include only the double-digit variants of date numbers without having to account for every variation in which a digit could be either single or double digit. With regard to the code below, I haven't taken time to extract it as a library, so please forgive the unexplained parameters such as "id" and "paramName".This code convers the time provided by the user into the current Greenwich Mean Time.

Java Source Fragments

private static final Pattern ISOLATED_DIGIT = Pattern.compile("(^|\\D|\\s)(\\d)(\\D|$|\\s)"); //digit can be surrounded by non-digit but also begining or end of line or whitespace


public static Object parseDate(String dateString, String id, String paramName) {
        if(null == dateString){
            return dateString;
        }
        if(dateString.equalsIgnoreCase("NOW")){
            //default time stamp is Zulu time (time at prime meridian, Greenwich)
            long currentTime = System.currentTimeMillis();
            Timestamp ts = new java.sql.Timestamp(currentTime-TimeZone.getDefault().getOffset(currentTime));
            return ts;
        }else{
            dateString = zeroPadIsolatedDigits(dateString);
            if(log.isDebugEnabled())log.debug("input zero-padded to " + dateString);
            for(Map.Entry entry:DATE_FORMATS.entrySet()){
                String key = entry.getKey().toString();
                DateFormat dateFormat = new SimpleDateFormat(entry.getValue().toString());
                dateFormat.setLenient(false);
                try{
                    java.util.Date date = dateFormat.parse(dateString);
                    
                    //I have found bugs in the parse method, in which strings that *don't* actually match the format template
                    //get parsed without Exception, thus truncating trailing parts of the date (like the timezone). For this reason
                    //I am forced to regurgitate the date string, and compare it to the input, and if they don't match, skip forward in the loop
                    String regurgitated = dateFormat.format(date);//...and no, we can't just compare the regurgitated string b/c PST and get adjusted to PDT
                    if(regurgitated.length() != dateString.length()){ //compare their lengths to make sure nothing got truncated
                        if(log.isDebugEnabled())log.debug(dateString + " was parsed with format " + entry.getValue() + " but was apparently parsed incorrectly as " + regurgitated + " length difference was "+(regurgitated.length()-dateString.length()));
                        continue;
                    }else{
                        //the length of the regurgitated string matches the length of the input.
                        //We still need to eat the regurgitated string, and compare the resulting  ms since epoch to the Date we originally parsed to make sure we didn't
                        //encounter a SimpleDateFormat parsing bug.
                         //Example: 2010-9-23 12:22:23.000 PST gets regurgintated as 2010-09-23 13:22:23.000 PDT, which is different text, but same ms since epoch
                        //So the above illustrates why we cannot just compare the regurgitated text to the input dateString, because Java may decide on an equivalent but
                        //textually different representation from the input (PST is the same as PDT in the half of the year when daylight savings time isn't in effect).
                        java.util.Date reparsed = dateFormat.parse(regurgitated);
                        if(reparsed.getTime() != date.getTime()){
                            if(log.isDebugEnabled())log.debug(dateString+" produces different ms since epoch than  " +regurgitated );
                            continue;
                        }
                    }
                    if(log.isDebugEnabled())log.debug("handled date" + dateString +" using format template: " + entry.getValue().toString() + " which regurgitated " + dateString);
                    TimeZone timeZone = dateFormat.getTimeZone();
                    //if(log.isDebugEnabled())log.debug(timeZone);
                    //convert to GMT time and account for timezone
                    return new java.sql.Timestamp(date.getTime()-timeZone.getOffset(date.getTime()));
                }catch(ParseException pe){
                    if(log.isDebugEnabled()){
                        log.debug(pe.getMessage());
                        log.debug(dateString + "couldn't be parsed with format " + entry.getValue());
                    }
                }
            }
            throw Util.newApplicationException(id, "date "+ dateString+" could not be parsed.","DataFormatException", 130, paramName);
        }
    }

     private static String zeroPadIsolatedDigits(String s) {
        //Java date parsing will prepend leading zeros to islolated digits. For this reason, I prepend a zero to the isolated digits before parsing the string, so that the regurgitation step
        //produces a string identical to the input
        boolean done = false;
        //this is an obscure case in which the matching regions overlap, therefore m.find() only finds one of the overlapping instance.
        //For instance, 100-1-2. Find() won't find "-2 " because it overalps "-1-". Consequently we have to repeatedly call find on a new matcher after each replacement has been made
        while (!done) {
            Matcher m = ISOLATED_DIGIT.matcher(s);
            StringBuffer buf = new StringBuffer();
            if (m.find()) {
                if(log.isDebugEnabled())log.debug("found isolated digit:" + m.group(2));
                m.appendReplacement(buf, m.group(1) + "0" + m.group(2) + m.group(3)); //need to stuff back in the stuff before the isolated digit, then 0 (zero-pad), then the digit, then stuff after)
            }else {
                done = true;
            }
            m.appendTail(buf); //OK, al isolated digits have been replaced by leading-zero padded digits
            s = buf.toString(); //replace input with zero-padded input
        }
        return s;
    }