hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mawata <chris.maw...@gmail.com>
Subject Re: grouping similar items toegther
Date Fri, 20 Jun 2014 22:09:55 GMT
1. We can't see your reduce algorithm so we can't tell you why the 'group'
you think should work is not working.
2. The relation you have is not transitive so you will not have equivalence
classes.
Chris
On Jun 20, 2014 2:51 PM, "parnab kumar" <parnab.2007@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Mime
View raw message