Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of balijamahesh.mca@gmail.com
 designates 209.85.217.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACb0Fn43o7ckD4LnTE8CBKhsNwxcoS8n2xBmBBaR1mb=t4oFPA@mail.gmail.com>
References: 
 <CACb0Fn6C+bmxsvYtrfCbnX0E5Sj5N4kzYv4aunozZZV=+fcF0g@mail.gmail.com>
	<869970D71E26D7498BDAC4E1CA92226B3FCD8FA2@MBX021-E3-NJ-2.exch021.domain.local>
	<CACb0Fn4wPaZ6DX8Q_=_rRPgXd1noExmOz4EDoB_ekNLXo4FouA@mail.gmail.com>
	<869970D71E26D7498BDAC4E1CA92226B3FCD901F@MBX021-E3-NJ-2.exch021.domain.local>
	<CACb0Fn43o7ckD4LnTE8CBKhsNwxcoS8n2xBmBBaR1mb=t4oFPA@mail.gmail.com>
Date: Tue, 8 Jan 2013 08:56:42 +0530
Message-ID: 
 <CANiuQZfbLJO=J4Usa1jGeUxJzN2QFGC=JkVR6=WEWUV_bdt_cw@mail.gmail.com>
Subject: Re: Binary Search in map reduce
From: Mahesh Balija <balijamahesh.mca@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d043bd8febc322804d2be8314

--f46d043bd8febc322804d2be8314
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into
memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records
from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same
key (the graph key and changes key).
       In order you to know which record is been emitted from which file
you can use key as the graph key for both the mappers but MapWritable as
your value in mapper where the key in the mapwritable will be some constant
say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the
actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <jamalshasha@gmail.com> wrote:

> awesome.
> thanks
>
>
> On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <john.lilley@redpoint.net>wro=
te:
>
>>  Let=92s call these =93the graph=94 and =93the changes=94.****
>>
>> ** **
>>
>> Will both the graph and the changes fit into memory?****
>>
>> Yes -> You do not have a Hadoop-scale problem.  Just write some code
>> using HashTable or Dictionary.****
>>
>> ** **
>>
>> Will the graph fit into memory once it is partitioned amongst all of the
>> nodes?****
>>
>> Yes -> You can get away without a join.  Partition the graph and the
>> changes like below, but instead of doing a join on each partition, strea=
m
>> the changes against the graph partition in memory, using a HashTable for
>> the graph partition.****
>>
>> ** **
>>
>> Otherwise, you can do this in a few steps.  Realize that you are doing a
>> parallel join.  A parallel join can be done in hadoop by a simple modulo=
 of
>> the keys of the graph and the changes.  So first, create a couple of MR
>> jobs just to partition =93the graph=94 and =93the changes=94 into N buck=
ets using
>> (key%N).  I **think** this is pretty straightforward because if your
>> mapper adds new_key=3D(key%N) to the tuple and you use N reducers you ge=
t
>> this behavior automatically (is it really that simple? someone with more=
 MR
>> expertise please correct me=85).   Once the graph and the changes are
>> partitioned, run another MR job to (1) join each graph partition file to
>> the corresponding changes partition file (2) process the changes into th=
e
>> graph (3) write out the resulting graph.  This part is not a parallel jo=
in;
>> it is a bunch of independent simple joins.  Finally, merge the resulting
>> graphs together.  ****
>>
>> ** **
>>
>> You may find that it isn=92t even this easy.  If nothing fits into memor=
y
>> and you must perform a non-trivial graph traversal for each change recor=
d,
>> you have something must harder to do.****
>>
>> ** **
>>
>> FYI top google results for joins in Hadoop here:
>> https://www.google.com/search?q=3Djoins+in+hadoop&aq=3Df&oq=3Djoins+in+h=
adoop&aqs=3Dchrome.0.57j60l2j0l2j62.670&sugexp=3Dchrome,mod=3D14&sourceid=
=3Dchrome&ie=3DUTF-8
>> ****
>>
>> ** **
>>
>> john****
>>
>> ** **
>>
>> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
>> *Sent:* Monday, January 07, 2013 4:43 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Binary Search in map reduce****
>>
>> ** **
>>
>> Hi****
>>
>>  Thanks for the reply. So here is the intent.****
>>
>> I process some data and output of that processing is this set of json
>> documents outputting {key:[values]}  (This is essentially a form of grap=
h
>> where each entry is an edge)****
>>
>> Now.. I process a different set of data and the idea is to modify the
>> existing document based on this new data.****
>>
>> If the key is present then add/modify values.****
>>
>> Else... create new key:[values] json object and save.****
>>
>> ** **
>>
>> So, the first step is checking whether the key is present or not..****
>>
>> So thats why I thought of doing the binary search.****
>>
>> Any suggestions?****
>>
>> ** **
>>
>> ** **
>>
>
>

--f46d043bd8febc322804d2be8314
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Jamal,<br><br>=A0=A0=A0=A0=A0=A0 Another simple approach if your data is=
 too huge and cannot fit into memory would be just to use the MultipleInput=
s mechanism.<br>=A0=A0=A0=A0=A0=A0 Where your MR job will have two mappers =
one emitting the records from &quot;the graph&quot; file and other from cha=
nges file.<br>
=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0=A0=A0=A0 Any how your reducer will aggrega=
te the records based on the same key (the graph key and changes key).<br>=
=A0=A0=A0=A0=A0=A0 In order you to know which record is been emitted from w=
hich file you can use key as the graph key for both the mappers but MapWrit=
able as your value in mapper where the key in the mapwritable will be some =
constant say 1 -&gt; the graph and 2 -&gt; changes and value will be the ac=
tual value.<br>
<br>=A0=A0=A0=A0=A0=A0 Now the only thing left for you is to append your ch=
anges to the actual key and emit the final result.<br><br>Best,<br>Mahesh B=
alija,<br>Calsoft Labs.<br><br><div class=3D"gmail_quote">On Tue, Jan 8, 20=
13 at 5:47 AM, jamal sasha <span dir=3D"ltr">&lt;<a href=3D"mailto:jamalsha=
sha@gmail.com" target=3D"_blank">jamalshasha@gmail.com</a>&gt;</span> wrote=
:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">awesome.<div>thanks</div></=
div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br>=
<br><div class=3D"gmail_quote">
On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <span dir=3D"ltr">&lt;<a href=
=3D"mailto:john.lilley@redpoint.net" target=3D"_blank">john.lilley@redpoint=
.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div link=3D"blue" vlink=3D"purple" lang=3D"EN-US">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Let=92s call these =93the=
 graph=94 and =93the changes=94.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Will both the graph and t=
he changes fit into memory?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Yes -&gt; You do not have=
 a Hadoop-scale problem.=A0 Just write some code using HashTable or Diction=
ary.<u></u><u></u></span></p>


<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Will the graph fit into m=
emory once it is partitioned amongst all of the nodes?<u></u><u></u></span>=
</p>


<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Yes -&gt; You can get awa=
y without a join.=A0 Partition the graph and the changes like below, but in=
stead of doing a join on each partition, stream the changes against
 the graph partition in memory, using a HashTable for the graph partition.<=
u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Otherwise, you can do thi=
s in a few steps.=A0 Realize that you are doing a parallel join.=A0 A paral=
lel join can be done in hadoop by a simple modulo of the keys
 of the graph and the changes.=A0 So first, create a couple of MR jobs just=
 to partition =93the graph=94 and =93the changes=94 into N buckets using (k=
ey%N).=A0 I *<b>think</b>* this is pretty straightforward because if your m=
apper adds new_key=3D(key%N) to the tuple and
 you use N reducers you get this behavior automatically (is it really that =
simple? someone with more MR expertise please correct me=85).=A0=A0 Once th=
e graph and the changes are partitioned, run another MR job to (1) join eac=
h graph partition file to the corresponding
 changes partition file (2) process the changes into the graph (3) write ou=
t the resulting graph.=A0 This part is not a parallel join; it is a bunch o=
f independent simple joins.=A0 Finally, merge the resulting graphs together=
.=A0
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">You may find that it isn=
=92t even this easy.=A0 If nothing fits into memory and you must perform a =
non-trivial graph traversal for each change record, you have
 something must harder to do.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">FYI top google results fo=
r joins in Hadoop here:
<a href=3D"https://www.google.com/search?q=3Djoins+in+hadoop&amp;aq=3Df&amp=
;oq=3Djoins+in+hadoop&amp;aqs=3Dchrome.0.57j60l2j0l2j62.670&amp;sugexp=3Dch=
rome,mod=3D14&amp;sourceid=3Dchrome&amp;ie=3DUTF-8" target=3D"_blank">
https://www.google.com/search?q=3Djoins+in+hadoop&amp;aq=3Df&amp;oq=3Djoins=
+in+hadoop&amp;aqs=3Dchrome.0.57j60l2j0l2j62.670&amp;sugexp=3Dchrome,mod=3D=
14&amp;sourceid=3Dchrome&amp;ie=3DUTF-8</a><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">john<u></u><u></u></span>=
</p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> jamal sa=
sha [mailto:<a href=3D"mailto:jamalshasha@gmail.com" target=3D"_blank">jama=
lshasha@gmail.com</a>]
<br>
<b>Sent:</b> Monday, January 07, 2013 4:43 PM<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a><br>
<b>Subject:</b> Re: Binary Search in map reduce<u></u><u></u></span></p><di=
v>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
<div>
<p class=3D"MsoNormal">Hi<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">=A0Thanks for the reply. So here is the intent.<u></=
u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I process some data and output of that processing is=
 this set of json documents outputting {key:[values]} =A0(This is essential=
ly a form of graph where each entry is an edge)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Now.. I process a different set of data and the idea=
 is to modify the existing document based on this new data.<u></u><u></u></=
p>
</div>
<div>
<p class=3D"MsoNormal">If the key is present then add/modify values.<u></u>=
<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Else... create new key:[values] json object and save=
.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">So, the first step is checking whether the key is pr=
esent or not..<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">So thats why I thought of doing the binary search.<u=
></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Any suggestions?<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>=A0<u></u></p>
<p class=3D"MsoNormal"><u></u>=A0<u></u></p>
</div>
</div></div>
</div>

</blockquote></div><br></div>
</div></div></blockquote></div><br>

--f46d043bd8febc322804d2be8314--