Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 88189 invoked from network); 26 Mar 2008 18:41:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Mar 2008 18:41:57 -0000 Received: (qmail 18923 invoked by uid 500); 26 Mar 2008 18:41:52 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 18893 invoked by uid 500); 26 Mar 2008 18:41:52 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 18884 invoked by uid 99); 26 Mar 2008 18:41:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 11:41:52 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of munkey906@gmail.com designates 72.14.246.245 as permitted sender) Received: from [72.14.246.245] (HELO ag-out-0708.google.com) (72.14.246.245) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 18:41:12 +0000 Received: by ag-out-0708.google.com with SMTP id 9so808637agd.9 for ; Wed, 26 Mar 2008 11:41:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=EBoIJzws+cE/aevqELVNnAv3AF1JutHrPR1a1ATiOlU=; b=PZykoUEImQ7SZSuLBSb0lwT6bv+FVhwZ04T5CctmvjHwY5UkOldSq5nvGtNFtvdqKVR/JNhwUTpyl04l3c3HbTwDcP+1w+Eeh/AOnOLD8oUypb9zMxUcLzShlvzXzIu7CjOqSPiSA0B6EZSPu19vJQw3tyZVePB3MEW/VbDiTk8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=spFuTZCbfqdm+wgVE9sZEmydTLCmptcP0E/MJpEjMoSRIm+nQcO1DEcImzBFc3wlfyI/ArNB46iaIbUlXxhggHLAY2fNT31Tm95EcT4SScCvwtW84gprHRmX94oVZX7I+752edtpm1bYll71yr+e97VamCw8MNQKOq+4oLmlclY= Received: by 10.151.82.3 with SMTP id j3mr235066ybl.78.1206556848351; Wed, 26 Mar 2008 11:40:48 -0700 (PDT) Received: by 10.150.195.9 with HTTP; Wed, 26 Mar 2008 11:40:48 -0700 (PDT) Message-ID: <3c8226f70803261140y155a7f8dm35550b6ab26cd2aa@mail.gmail.com> Date: Wed, 26 Mar 2008 12:40:48 -0600 From: "Theodore Van Rooy" To: core-user@hadoop.apache.org Subject: Re: Hadoop: Multiple map reduce or some better way In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_21798_29807349.1206556848358" References: <746C2304-9892-4239-A35C-CE70F8B8B6CE@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_21798_29807349.1206556848358 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline In my experience the advice above is good... the less reading and writing that you have to do at each step the better. While you could do map | reduce | map | reduce as you are proposing... perhaps you could try several maps in a row i.e. map - | no reduce -> map- | reduce - also, if you consider how hadoop streaming works you might just write a script in python (or whatever) that does stdin | map -everything you want to do in one script | reduce - aggregate results of previous map script because your data set is spread out in x number of blocks you may be able to gain more parallelization speedup by simply doing everything you want in one step and then aggregating it with a reduce. Though this sidesteps the mapReduce paradigm of , it acheives the benefit of using hadoop to handle the distribution of tasks and pieces of the file. On Wed, Mar 26, 2008 at 12:19 PM, Arun C Murthy wrote: > > On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote: > > > > > On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote: > > > >> HI, > >> I am developing the simple inverted index program frm the hadoop. > >> My map > >> function has the output: > >> > >> and the reducer has: > >> > >> > >> Now I want to use one more mapreduce to remove stop and scrub > >> words from > >> this output. Also in the next stage I would like to have short summay > >> associated with every word. How should I design my program from > >> this stage? > >> I mean how would I apply multiple mapreduce to this? What would be > >> the > >> better way to perform this? > >> > > > > In general you are better off with lesser number of Map-Reduce > > jobs ... lesser i/o works better. > > > > I forgot to add that you can use the apis in JobClient and JobControl > to chain jobs together ... > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job > +Control > http://hadoop.apache.org/core/docs/current/ > mapred_tutorial.html#JobControl > > Arun > > > Use the DistributedCache if you can and fix your first Map to not > > emit the stop words at all. Use the combiner to crunch down amount > > of intermediate map-outputs etc. > > > > Something useful to look at: > > http://hadoop.apache.org/core/docs/current/ > > mapred_tutorial.html#Example%3A+WordCount+v2.0 > > > > Arun > > > >> Thanks, > >> > >> Regards, > >> - > >> Aayush Garg, > >> Phone: +41 76 482 240 > > > > -- Theodore Van Rooy http://greentheo.scroggles.com ------=_Part_21798_29807349.1206556848358--