hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <Sanjay.Subraman...@wizecommerce.com>
Subject Re: How to design the mapper and reducer for the following problem
Date Fri, 14 Jun 2013 16:15:46 GMT
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2
HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2  
HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <parnab.2007@gmail.com<mailto:parnab.2007@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some
fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number
of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient, please contact the sender
by reply email and destroy all copies of the original message along with any attachments,
from your computer system. If you are the intended recipient, please be advised that the content
of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Mime
View raw message