hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parnab kumar <parnab.2...@gmail.com>
Subject grouping similar items toegther
Date Fri, 20 Jun 2014 18:51:12 GMT

    I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
are similar if their corresponding hamming distance is less than equal to 2.

I need to group together hashes that are mutually similar to one another
i.e in the output file in each line i should have mutually similar keys.

I implemented a customer writable and the compareTo method looks  as
follows :

*public int compareTo(Object o) {*
* Long thisHash = this.hash*
* Long thatHash = ((DocumentHash)o).hash.;*
* if(hammingDist(thisHash, thatHash)<=2){*
* return 0;*
* }*
 * return thisHash.compareTo(thatHash);*
* }*

In the Map function I emit the customWritable as the key and in the reduce
group by the keys.

I checked the output file and exhaustively tested the hashes manually and
found that most hashes are mutually similar in each line. However, i found
that some hashes even though they are similar to a group are not in the

For example: consider the following hashes :

HASH1 = 69215512
HASH2 =  69215512
HASH3 =  69215512
HASH4 = 69215568

All the above 4 hashes are mutually similar and are within a distance 2 of
each other. Still in the output file i found two separate records where
HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
as follows:


Can someone specify why the above happens ???


View raw message