Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net
 designates 206.225.164.218 as permitted sender)
From: John Lilley <john.lilley@redpoint.net>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: How to design the mapper and reducer for the following problem
Thread-Topic: How to design the mapper and reducer for the following problem
Thread-Index: AQHOaQhk2Hde3XzDmUCJO8Bkf7MH15k4uEMQ
Date: Sun, 16 Jun 2013 19:25:20 +0000
Message-ID: 
 <869970D71E26D7498BDAC4E1CA92226B658C6F60@MBX021-E3-NJ-2.exch021.domain.local>
References: 
 <CAMOErynbu-eWJBKUjfNaHVj_+hXYNa4F8yO9Bmub7G4s_OBjQg@mail.gmail.com>
In-Reply-To: 
 <CAMOErynbu-eWJBKUjfNaHVj_+hXYNa4F8yO9Bmub7G4s_OBjQg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_"
MIME-Version: 1.0

--_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

You basically have a "record similarity scoring and linking" problem -- com=
mon in data-quality software like ours.  This could be thought of as comput=
ing the cross-product of all records, counting the number of hash keys in c=
ommon, and then outputting those that exceed a threshold.  This is very slo=
w for large data because of N-squared size of intermediate data set or at l=
east the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such=
 that all instances of records containing a given hash key can fit into mem=
ory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH=
2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the=
 hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as =
key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs f=
or those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Red=
ucer1 with a new job that joins by HASH.  Joins are not standardized in MR =
but can be done with MultipleInputs, and of course Pig has this built in.  =
Searching on "Hadoop join" will give you some ideas of how to implement in =
straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is i=
dentfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share=
 a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

--_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:685130620;
	mso-list-type:hybrid;
	mso-list-template-ids:1412442102 67698705 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-text:"%1\)";
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level2
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level3
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level4
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level5
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level6
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level7
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level8
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level9
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">You basically have a &#82=
20;record similarity scoring and linking&#8221; problem -- common in data-q=
uality software like ours.&nbsp; This could be thought of as computing the
 cross-product of all records, counting the number of hash keys in common, =
and then outputting those that exceed a threshold.&nbsp; This is very slow =
for large data because of N-squared size of intermediate data set or at lea=
st the number of iterations.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">If you have assurance tha=
t the frequency of a given HASH value is low, such that all instances of re=
cords containing a given hash key can fit into memory, it
 can be done as follows:<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"font-size:11.0pt;font-family:&q=
uot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style=3D"mso-=
list:Ignore">1)<span style=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style=3D"font-size:11.0pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Mapper1 outputs f=
our tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{H=
ASH4,DOCID} per input record<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"font-size:11.0pt;font-family:&q=
uot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style=3D"mso-=
list:Ignore">2)<span style=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style=3D"font-size:11.0pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Reducer1 loads al=
l tuples with same HASH into memory.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"font-size:11.0pt;font-family:&q=
uot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style=3D"mso-=
list:Ignore">3)<span style=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style=3D"font-size:11.0pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Reducer1 outputs =
all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, b=
ut only output where DOCID1 &lt; DOCID2)<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"font-size:11.0pt;font-family:&q=
uot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style=3D"mso-=
list:Ignore">4)<span style=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style=3D"font-size:11.0pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Mapper2 load tupl=
es from Reducer1 and treats { DOCID1, DOCID2 } as key
<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"font-size:11.0pt;font-family:&q=
uot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style=3D"mso-=
list:Ignore">5)<span style=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style=3D"font-size:11.0pt;font-family:=
&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Reducer2 counts {=
DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding thresh=
old.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">If you have no such assur=
ance, make Mapper1 a map-only job, and replace Reducer1 with a new job that=
 joins by HASH.&nbsp; Joins are not standardized in MR but can
 be done with MultipleInputs, and of course Pig has this built in.&nbsp; Se=
arching on &#8220;Hadoop join&#8221; will give you some ideas of how to imp=
lement in straight MR.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">John<o:p></o:p></span></p=
>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> parnab k=
umar [mailto:parnab.2007@gmail.com]
<br>
<b>Sent:</b> Friday, June 14, 2013 8:06 AM<br>
<b>To:</b> user@hadoop.apache.org<br>
<b>Subject:</b> How to design the mapper and reducer for the following prob=
lem<o:p></o:p></span></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class=3D"MsoNormal">An input file where each line corresponds to a docum=
ent .Each document is identfied by some fingerPrints .For example a line in=
 the input file&nbsp;<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">is of the following form :<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">input:<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">---------------------<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">DOCID1 &nbsp; HASH1 HASH2 HASH3 HASH4<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">DOCID2 &nbsp; HASH5 HASH3 HASH1 HASH4<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">The output of the mapreduce job should write the pai=
r of DOCIDS which share a threshold number of HASH in common.<o:p></o:p></p=
>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">output:<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">--------------------------<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">DOCID1 DOCID2<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">DOCID3 DOCID5&nbsp;<o:p></o:p></p>
</div>
</div>
</body>
</html>

--_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_--