hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Shi <s...@gopivotal.com>
Subject Re: grouping similar items toegther
Date Tue, 24 Jun 2014 06:17:11 GMT
The "similar" logic is not transitive, that means, if a is similar to b, b
is similar to c, but a may be not similar to c;
then how do you do the group?

Regards,
*Stanley Shi,*



On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <parnab.2007@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Mime
View raw message