when I'm not skateboarding: Minhashing is reaaally cool

What is the Jaccard Coefficient? Answer: one of the most important concepts in information retrieval and theory of sets. The Jaccard Coeffieicnt is very simple. It is the measure of the fractional similarity of two sets. Let's take a simple example {A,B,C,D,E,F} vs. {B,E,F}. Fractional similarity of these sets is 3/6. Easily you can see that 3 elements are shared (the intersection is {B,E,F}), whereas the total number of items in both sets (the union) is 6 elements ({A,B,C,D,E,F}).

So without any mumbo-jumbo, we can see the Jaccard Coefficient (JC) is just the ratio of sameness (intersection) over the total amount of stuff present. Technically it's the cardinality (count) of elements in the intersection over cardinality (count) of elements in the union. So while the JC between {A,B,C,D,E,F} and {B,E,F} is 50%, we can see that the JC between {A,B,C,D,E,F,G,H,I} and {B,E,F} is only 33.33%. This illustrates the importance of the total amount of stuff present. The more irrelevant (non-intersection) stuff is present, the lower the Jaccard Coefficient.

Now pretend we have two bags. Bag 1 contains Set 1, and Bag 2 contains Set 2. I tell you the JC between these two sets is 95%. I reach into Bag 1 and I pull out the element "M". What is the probability that the element "M" is also in Bag 2? It's 95%. That's what the Jaccard Coefficient tells me. But now here is something quite amazing. What is the probability that the lowest symbol (lowest according to some ordering, like a hashcode) in Bag 1 is the *same* letter as the lowest letter in Bag 2. Again, it's 95%. Why? Some letter has to be the lowest one in a bag. Since only 5% of the letters in the other bag ought to be different, then there is a 95% chance that the lowest symbol in the second bag falls in the intersection of the sets, and is consequently the same.

Now it really gets interesting. If I group sets by a key, where the key is the lowest element of the set, then the probability that two sets get grouped by the same key is again the Jaccard Coefficient. That's cool. That's BADASS. Why? It's an almost shockingly simple way to create clusters of similar sets.

I find a lot of convoluted explanations of Minhashing, but you really don't need to know more than this guy's explanation. Here's a short piece of code, that given a sentence, splits the sentence into words, then builds N-grams out of the words (the N-grams are the set elements). Surrounding a word with ^ and $ is a well-known way for demarcating N-grams that are beginning and endings of words. A SHA-1 cryotographic hash function is a nice way to assign a random, but consistent ordering to any possible N-gram we might come up with. As suggested by this guy, rather than clustering on the *single* lowest element, we can tighten the clusters up by making the cluster key the numMins smallest elements, concatenated. I use this in a mapreduce job, where I write the minhash out as the key from the mapper, and the reducer collects all the sentences with the same minHash key, thereby clustering similar sentences. And lo and behold...it works.

      protected String minHash(String sentence, int nGramLength, int numMins) throws Exception{
            MessageDigest md = MessageDigest.getInstance("SHA-1");
            PriorityQueue queue = new PriorityQueue();
            for(String part :sentence.split("\\s")){
                for(String gram:getNGrams("^"+part+"$",nGramLength)){
                    md.reset();
                    queue.add(ByteBuffer.wrap(md.digest(gram.getBytes("UTF-8"))).getLong());
                }
            }            
            StringBuilder minHash = new StringBuilder();
            while(null != queue.peek() && numMins > 0){
                minHash.append(queue.remove());
                --numMins;
            }
            return minHash.toString();
        }.

4 comments:

Jean-Pierre11:47 PM
Hi Geoffrey.

I wonder how the implementation of getNGrams(String part, int nGramLength) looks like since N-grams are the set elements and the parameter part is just a word here....

Cheers jp
Geoffrey Hendrey7:00 AM
2grams of "automatic": AU UT TO OM MA AT TI IC
So you can see that grams come from characters of individual words. Hope that helps.
Anonymous2:01 PM
This is incorrect, "What is the probability that the element "M" is also in Bag 2? It's 95%."
Geoffrey Hendrey10:07 AM
Help me understand why you say this is incorrect. By definition, two sets with jaccard coefficient of 0.95 have a 95% probability of intersection for any given element. Therefore, we can restate my assertion as "given that set A (bag 1) contains element x, what is the probability that set B (bag 2) also contains element x?", which can simply be restated as "what is the probability of intersection between set A and set B", which can in turn be restated as "what is the jaccard coefficient between set A and set B.".

when I'm not skateboarding

Tuesday, January 04, 2011

Minhashing is reaaally cool

4 comments:

Followers

Blog Archive

About Me

when I'm not skateboarding

Tuesday, January 04, 2011

Minhashing is reaaally cool

4 comments:

Subscribe To

Followers

Blog Archive

About Me