mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 刘鎏 <liuliu....@gmail.com>
Subject Re: MinHash implementation
Date Tue, 16 Aug 2011 06:23:40 GMT
I think, if your input vector is a set, the ele.get() should be used,
instead, if your input vector is a sparse vector, the ele.index() would be
used.

Pls correct me if I'm wrong.

 for (int i = 0; i < numHashFunctions; i++) {
     for (Vector.Element ele : featureVector) {

/// Shouldn't the following line say ele.index();
       int value = (int) ele.get();

       bytesToHash[0] = (byte) (value >> 24);
       bytesToHash[1] = (byte) (value >> 16);
       bytesToHash[2] = (byte) (value >> 8);
       bytesToHash[3] = (byte) value;
       int hashIndex = hashFunction[i].hash(bytesToHash);
       if (minHashValues[i] > hashIndex) {
         minHashValues[i] = hashIndex;
       }
     }
   }

On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <dscheffy@gmail.com> wrote:

> I'm having a little trouble understanding Mahout's minhash implementation.
>
> Please correct me if I'm wrong, but the general intent of minhash is to
> evaluate the similarity of two sparse feature vectors based on the features
> they have in common, not necessarily the value of those features (as the
> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> vector).  So given a space of 10 dimensions, if Jack had features 4 and 6
> and Jill had features 5 and 6, Jacks vector would look something like {4:1,
> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
> features in common, their Jaccard coefficient is 1/3.  Also, given K random
> hash functions, we would expect about a third of them to return a minimum
> value for each of the three keys 4, 5 and 6 and thus about a third of them
> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> third that return a minimum hash value for the key 6).  That's my basic
> English explanation of the purpose of minhash -- again, somebody please
> correct me if I'm wrong.
>
> Given that understanding, can somebody explain why Mahout's minhash
> implentation is hasing the values from the feature vectors rather than the
> keys?
>
> See the following code from MinHashMapper.java
>
>    for (int i = 0; i < numHashFunctions; i++) {
>      for (Vector.Element ele : featureVector) {
>
> /// Shouldn't the following line say ele.index();
>        int value = (int) ele.get();
>
>        bytesToHash[0] = (byte) (value >> 24);
>        bytesToHash[1] = (byte) (value >> 16);
>        bytesToHash[2] = (byte) (value >> 8);
>        bytesToHash[3] = (byte) value;
>        int hashIndex = hashFunction[i].hash(bytesToHash);
>        if (minHashValues[i] > hashIndex) {
>          minHashValues[i] = hashIndex;
>        }
>      }
>    }
>
> The code in TestMinHashClustering also seems to written with the
> expectation
> that minhash should be hashing the values rather than the keys.  Am I
> reading something wrong here?  Is this the intended use?  Are we supposed
> to
> be putting the feature ids into the double value fields of the feature
> vectors?
>
> Thanks,
> Jeff
>



--

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message