Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 56132E631 for ; Tue, 8 Jan 2013 00:12:12 +0000 (UTC) Received: (qmail 45460 invoked by uid 500); 8 Jan 2013 00:12:07 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 45105 invoked by uid 500); 8 Jan 2013 00:12:07 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 45097 invoked by uid 99); 8 Jan 2013 00:12:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jan 2013 00:12:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.218 as permitted sender) Received: from [206.225.164.218] (HELO hub021-nj-3.exch021.serverdata.net) (206.225.164.218) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jan 2013 00:11:58 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-3.exch021.domain.local ([10.240.4.36]) with mapi id 14.02.0318.001; Mon, 7 Jan 2013 16:11:37 -0800 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: Binary Search in map reduce Thread-Topic: Binary Search in map reduce Thread-Index: AQHN7S3I51Wsom5FOU+4Fdsn7Rc71Zg+gsvggACKZwD//3p28A== Date: Tue, 8 Jan 2013 00:11:37 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B3FCD901F@MBX021-E3-NJ-2.exch021.domain.local> References: <869970D71E26D7498BDAC4E1CA92226B3FCD8FA2@MBX021-E3-NJ-2.exch021.domain.local> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [206.168.224.109] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Let's call these "the graph" and "the changes". Will both the graph and the changes fit into memory? Yes -> You do not have a Hadoop-scale problem. Just write some code using = HashTable or Dictionary. Will the graph fit into memory once it is partitioned amongst all of the no= des? Yes -> You can get away without a join. Partition the graph and the change= s like below, but instead of doing a join on each partition, stream the cha= nges against the graph partition in memory, using a HashTable for the graph= partition. Otherwise, you can do this in a few steps. Realize that you are doing a pa= rallel join. A parallel join can be done in hadoop by a simple modulo of t= he keys of the graph and the changes. So first, create a couple of MR jobs= just to partition "the graph" and "the changes" into N buckets using (key%= N). I *think* this is pretty straightforward because if your mapper adds n= ew_key=3D(key%N) to the tuple and you use N reducers you get this behavior = automatically (is it really that simple? someone with more MR expertise ple= ase correct me...). Once the graph and the changes are partitioned, run a= nother MR job to (1) join each graph partition file to the corresponding ch= anges partition file (2) process the changes into the graph (3) write out t= he resulting graph. This part is not a parallel join; it is a bunch of ind= ependent simple joins. Finally, merge the resulting graphs together. You may find that it isn't even this easy. If nothing fits into memory and= you must perform a non-trivial graph traversal for each change record, you= have something must harder to do. FYI top google results for joins in Hadoop here: https://www.google.com/sea= rch?q=3Djoins+in+hadoop&aq=3Df&oq=3Djoins+in+hadoop&aqs=3Dchrome.0.57j60l2j= 0l2j62.670&sugexp=3Dchrome,mod=3D14&sourceid=3Dchrome&ie=3DUTF-8 john From: jamal sasha [mailto:jamalshasha@gmail.com] Sent: Monday, January 07, 2013 4:43 PM To: user@hadoop.apache.org Subject: Re: Binary Search in map reduce Hi Thanks for the reply. So here is the intent. I process some data and output of that processing is this set of json docum= ents outputting {key:[values]} (This is essentially a form of graph where = each entry is an edge) Now.. I process a different set of data and the idea is to modify the exist= ing document based on this new data. If the key is present then add/modify values. Else... create new key:[values] json object and save. So, the first step is checking whether the key is present or not.. So thats why I thought of doing the binary search. Any suggestions? --_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Let’s call these &#= 8220;the graph” and “the changes”.

 <= /p>

Will both the graph and t= he changes fit into memory?

Yes -> You do not have= a Hadoop-scale problem.  Just write some code using HashTable or Dict= ionary.

 <= /p>

Will the graph fit into m= emory once it is partitioned amongst all of the nodes?

Yes -> You can get awa= y without a join.  Partition the graph and the changes like below, but= instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.<= o:p>

 <= /p>

Otherwise, you can do thi= s in a few steps.  Realize that you are doing a parallel join.  A= parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs j= ust to partition “the graph” and “the changes” into= N buckets using (key%N).  I *think* this is pretty straightfor= ward because if your mapper adds new_key=3D(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that = simple? someone with more MR expertise please correct me…). &nbs= p; Once the graph and the changes are partitioned, run another MR job to (1= ) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write ou= t the resulting graph.  This part is not a parallel join; it is a bunc= h of independent simple joins.  Finally, merge the resulting graphs to= gether. 

 <= /p>

You may find that it isn&= #8217;t even this easy.  If nothing fits into memory and you must perf= orm a non-trivial graph traversal for each change record, you have something must harder to do.

 <= /p>

FYI top google results fo= r joins in Hadoop here: https://www.google.com/search?q=3Djoins+in+hadoop&aq=3Df&oq= =3Djoins+in+hadoop&aqs=3Dchrome.0.57j60l2j0l2j62.670&sugexp= =3Dchrome,mod=3D14&sourceid=3Dchrome&ie=3DUTF-8

 <= /p>

john

 <= /p>

From: jamal sa= sha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

 

Hi

 Thanks for the reply. So here is the intent.

I process some data and output of that processing is= this set of json documents outputting {key:[values]}  (This is essent= ially a form of graph where each entry is an edge)

Now.. I process a different set of data and the idea= is to modify the existing document based on this new data.

If the key is present then add/modify values.

Else... create new key:[values] json object and save= .

 

So, the first step is checking whether the key is pr= esent or not..

So thats why I thought of doing the binary search.

Any suggestions?

 

 

--_000_869970D71E26D7498BDAC4E1CA92226B3FCD901FMBX021E3NJ2exch_--