Some Mahout people are not sure that the MinHash implementation is
right. You should try your versions of how the algorithms should work.
On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <chszwu@yahoo.com> wrote:
> Sorry, forget about my previous comment about 2.
> 
> For general purpose, sometime we might need to hash on value, so we might
> need an extra parameter , hashOnKey ?
>
> 
> This is not correct, the hash key selection is based on mappers input values.
> if the mapper input value is vector, then we need to select by vector key or value (using
value() or index()), not
> by mapper key or mapper value.
>
> the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue)
>
>
>
> Sam
>
>
> ________________________________
> From: sam wu <swu5530@gmail.com>
> To: user@mahout.apache.org
> Sent: Friday, June 15, 2012 5:38 PM
> Subject: MinHash implementation thoughts
>
> I am playing with MinHashMapper, and have the following 2 penny thoughts
>
> 1. Efficiency
>
> in the map(..), around line 85
> 
> for (int i = 0; i < numHashFunctions; i++) {
> for (Vector.Element ele : featureVector) { *
> int value = (int) ele.get();
**
> bytesToHash[0] = (byte) (value >> 24);
> .....
> }
> }
> }
> 
> the featureVector (which can be random sparse vector) iterator * goes
> through all elements (including nonzero).
> When I test ASF email program (from Grant's ...). Each email has ~30000
> elements.
> By using iteratorNonZero(),
> I reduce from 30000 to around 50 operations (both for converting from int
> to byte[], and hash computation) for some email.???
>
> 2. which field key to hash
> we read from sequence file, so we have key and value.
> If we read from tfidf doc sequenceFile, then
> key will be termId, value will be tfidf value.
> What to hash for MinHash ?
> I think we need to hash on termId, instead of tfidf (meaningless for doc
> MinHash ?)
> so we need to get
> value = ele.index() **
>
> When I test the current code, I get almost all value = 0
>
> For general purpose, sometime we might need to hash on value, so we might
> need an extra parameter , hashOnKey ?
>
> 3. How to compute custerID ?
>
> I thought LSH normally has r(rows in a band) and b (band).
> numberHashFunction = r *b
>
> #clusterId = b
>
> so we can easily compute probability 1 exp( (1exp(s,r)),b)
>
> for (int i = 0; i < b; i++) {
> for (int j = 0; j < r; j++) {
> clusterIdBuilder.append(minHashValues[i * b + j ]).append('');
> }
> ....
> }
>
> the current implementation works in some way, but ??
>
>
> This is Friday late afternoon,
>
> Hope I am still making sense.
>
>
> BR
>
> Sam

Lance Norskog
goksron@gmail.com
