hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: How to design the mapper and reducer for the following problem
Date Sun, 16 Jun 2013 19:40:01 GMT
On further thought, it would be simpler to augment Reducer1 to use disk when it does not fit
into memory.  Nested looping over the disk file is sequential and will be fast.  Then you
can avoid the distributed join.
john

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Sunday, June 16, 2013 1:25 PM
To: user@hadoop.apache.org
Subject: RE: How to design the mapper and reducer for the following problem

You basically have a "record similarity scoring and linking" problem -- common in data-quality
software like ours.  This could be thought of as computing the cross-product of all records,
counting the number of hash keys in common, and then outputting those that exceed a threshold.
 This is very slow for large data because of N-squared size of intermediate data set or at
least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances
of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID}
per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested
loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding
threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new
job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs,
and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas
of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some
fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number
of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

Mime
View raw message