Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net
 designates 206.225.164.218 as permitted sender)
From: John Lilley <john.lilley@redpoint.net>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: Binary Search in map reduce
Thread-Topic: Binary Search in map reduce
Thread-Index: AQHN7S3I51Wsom5FOU+4Fdsn7Rc71Zg+gsvggACKZwD//3p28A==
Date: Tue, 8 Jan 2013 00:11:37 +0000
Message-ID: 
 <869970D71E26D7498BDAC4E1CA92226B3FCD901F@MBX021-E3-NJ-2.exch021.domain.local>
References: 
 <CACb0Fn6C+bmxsvYtrfCbnX0E5Sj5N4kzYv4aunozZZV=+fcF0g@mail.gmail.com>
	<869970D71E26D7498BDAC4E1CA92226B3FCD8FA2@MBX021-E3-NJ-2.exch021.domain.local>
 <CACb0Fn4wPaZ6DX8Q_=_rRPgXd1noExmOz4EDoB_ekNLXo4FouA@mail.gmail.com>
In-Reply-To: 
 <CACb0Fn4wPaZ6DX8Q_=_rRPgXd1noExmOz4EDoB_ekNLXo4FouA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_"
MIME-Version: 1.0

--_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using =
HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the no=
des?
Yes -> You can get away without a join.  Partition the graph and the change=
s like below, but instead of doing a join on each partition, stream the cha=
nges against the graph partition in memory, using a HashTable for the graph=
 partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a pa=
rallel join.  A parallel join can be done in hadoop by a simple modulo of t=
he keys of the graph and the changes.  So first, create a couple of MR jobs=
 just to partition "the graph" and "the changes" into N buckets using (key%=
N).  I *think* this is pretty straightforward because if your mapper adds n=
ew_key=3D(key%N) to the tuple and you use N reducers you get this behavior =
automatically (is it really that simple? someone with more MR expertise ple=
ase correct me...).   Once the graph and the changes are partitioned, run a=
nother MR job to (1) join each graph partition file to the corresponding ch=
anges partition file (2) process the changes into the graph (3) write out t=
he resulting graph.  This part is not a parallel join; it is a bunch of ind=
ependent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and=
 you must perform a non-trivial graph traversal for each change record, you=
 have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/sea=
rch?q=3Djoins+in+hadoop&aq=3Df&oq=3Djoins+in+hadoop&aqs=3Dchrome.0.57j60l2j=
0l2j62.670&sugexp=3Dchrome,mod=3D14&sourceid=3Dchrome&ie=3DUTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json docum=
ents outputting {key:[values]}  (This is essentially a form of graph where =
each entry is an edge)
Now.. I process a different set of data and the idea is to modify the exist=
ing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?


--_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:1432623580;
	mso-list-type:hybrid;
	mso-list-template-ids:1769215156 67698705 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-text:"%1\)";
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level2
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level3
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level4
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level5
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level6
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level7
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level8
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level9
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Let&#8217;s call these &#=
8220;the graph&#8221; and &#8220;the changes&#8221;.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Will both the graph and t=
he changes fit into memory?<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Yes -&gt; You do not have=
 a Hadoop-scale problem.&nbsp; Just write some code using HashTable or Dict=
ionary.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Will the graph fit into m=
emory once it is partitioned amongst all of the nodes?<o:p></o:p></span></p=
>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Yes -&gt; You can get awa=
y without a join.&nbsp; Partition the graph and the changes like below, but=
 instead of doing a join on each partition, stream the changes against
 the graph partition in memory, using a HashTable for the graph partition.<=
o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Otherwise, you can do thi=
s in a few steps.&nbsp; Realize that you are doing a parallel join.&nbsp; A=
 parallel join can be done in hadoop by a simple modulo of the keys
 of the graph and the changes.&nbsp; So first, create a couple of MR jobs j=
ust to partition &#8220;the graph&#8221; and &#8220;the changes&#8221; into=
 N buckets using (key%N).&nbsp; I *<b>think</b>* this is pretty straightfor=
ward because if your mapper adds new_key=3D(key%N) to the tuple and
 you use N reducers you get this behavior automatically (is it really that =
simple? someone with more MR expertise please correct me&#8230;).&nbsp;&nbs=
p; Once the graph and the changes are partitioned, run another MR job to (1=
) join each graph partition file to the corresponding
 changes partition file (2) process the changes into the graph (3) write ou=
t the resulting graph.&nbsp; This part is not a parallel join; it is a bunc=
h of independent simple joins.&nbsp; Finally, merge the resulting graphs to=
gether.&nbsp;
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">You may find that it isn&=
#8217;t even this easy.&nbsp; If nothing fits into memory and you must perf=
orm a non-trivial graph traversal for each change record, you have
 something must harder to do.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">FYI top google results fo=
r joins in Hadoop here:
<a href=3D"https://www.google.com/search?q=3Djoins&#43;in&#43;hadoop&amp;aq=
=3Df&amp;oq=3Djoins&#43;in&#43;hadoop&amp;aqs=3Dchrome.0.57j60l2j0l2j62.670=
&amp;sugexp=3Dchrome,mod=3D14&amp;sourceid=3Dchrome&amp;ie=3DUTF-8">
https://www.google.com/search?q=3Djoins&#43;in&#43;hadoop&amp;aq=3Df&amp;oq=
=3Djoins&#43;in&#43;hadoop&amp;aqs=3Dchrome.0.57j60l2j0l2j62.670&amp;sugexp=
=3Dchrome,mod=3D14&amp;sourceid=3Dchrome&amp;ie=3DUTF-8</a><o:p></o:p></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">john<o:p></o:p></span></p=
>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> jamal sa=
sha [mailto:jamalshasha@gmail.com]
<br>
<b>Sent:</b> Monday, January 07, 2013 4:43 PM<br>
<b>To:</b> user@hadoop.apache.org<br>
<b>Subject:</b> Re: Binary Search in map reduce<o:p></o:p></span></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class=3D"MsoNormal">Hi<o:p></o:p></p>
<div>
<p class=3D"MsoNormal">&nbsp;Thanks for the reply. So here is the intent.<o=
:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">I process some data and output of that processing is=
 this set of json documents outputting {key:[values]} &nbsp;(This is essent=
ially a form of graph where each entry is an edge)<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">Now.. I process a different set of data and the idea=
 is to modify the existing document based on this new data.<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">If the key is present then add/modify values.<o:p></=
o:p></p>
</div>
<div>
<p class=3D"MsoNormal">Else... create new key:[values] json object and save=
.<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">So, the first step is checking whether the key is pr=
esent or not..<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">So thats why I thought of doing the binary search.<o=
:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal">Any suggestions?<o:p></o:p></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
</div>
</body>
</html>

--_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_--