From common-dev-return-72264-apmail-hadoop-common-dev-archive=hadoop.apache.org@hadoop.apache.org Tue Jul 07 08:14:10 2009 Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 39579 invoked from network); 7 Jul 2009 08:14:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Jul 2009 08:14:09 -0000 Received: (qmail 43156 invoked by uid 500); 7 Jul 2009 08:14:18 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 43056 invoked by uid 500); 7 Jul 2009 08:14:18 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 43045 invoked by uid 99); 7 Jul 2009 08:14:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 08:14:18 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jason.hadoop@gmail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-vw0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 08:14:07 +0000 Received: by vwj5 with SMTP id 5so577556vwj.5 for ; Tue, 07 Jul 2009 01:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=3S2/72pW3YiH5C63yQCa42sOqrL7XQ+P/ARPw/wOyjA=; b=OUcSNFBE77U/ZQ6hKZqKFoiFufaGVxlrHu1UY4Otcncxj1WVfjxpgOV8uHUDG0X6dQ uSvAgGFLlcG8nZ8b1lSvpfqOHyfaMunA8F4eM1Td7Gk+85q5HInJoGcR0Tb2dBBgJFQG Ba2j6mgo0YXi/d/srfwdSwo4//cKRA3M/hL5Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Bd7IFKTsQiJ5cfd9c40S8AXReR+Hp5FppZ1WlwqvK97UwUgK5in8PVCYDPxENYVDT1 UGQqDjRIJ40UxgQ8PmwVJ47FRi91XPVxPUr/NXLlAcrjk2U4rbBVt2MvfSEBIveW66AQ 8FDjwD0e0aprbqnDHeNuEJJDrY2D3IX9jiYjw= MIME-Version: 1.0 Received: by 10.220.93.65 with SMTP id u1mr11658581vcm.59.1246954426235; Tue, 07 Jul 2009 01:13:46 -0700 (PDT) In-Reply-To: <4A52F989.3020607@cloudera.com> References: <24360327.post@talk.nabble.com> <4A52F989.3020607@cloudera.com> Date: Tue, 7 Jul 2009 01:13:46 -0700 Message-ID: <314098690907070113ga2172f6v3a3c21c7475b78d1@mail.gmail.com> Subject: Re: Need help understanding the source From: jason hadoop To: common-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016364ed386a07a7e046e19338c X-Virus-Checked: Checked by ClamAV on apache.org --0016364ed386a07a7e046e19338c Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit If your constraints are loose enough, you could consider using the chain mapping that became available in 19, and have multiple mappers for your jobs. The extra mappers only receive the output of the prior map in the chain and if I remember correctly, the combiner is run at the end of the chain of mappers, when a reduce is scheduled. The other alternative you may try is simply to write your map outputs to HDFS [ie: setNumReduces(0)], and have a consumer pick up the map outputs as they appear. If the life of the files is short and you can withstand data loss, you may turn down the replication factor, to speed the writes. The Map/Reduce framework is carefully constructed to provide a completely sorted input set to the reducer, as that is part of the fundamental contract. On Tue, Jul 7, 2009 at 12:30 AM, Amr Awadallah wrote: > To add to Todd/Ted's wise words, the Hadoop (and MapReduce) architects > didn't impose this limitation just for fun, it is very core to enabling > Hadoop to be as reliable as it is. If the reducer starts processing mapper > output immediately and a specific mapper fails then the reducer would have > to know how to undo the specific pieces of work related to the failed > mapper, not trivial at all. That said, the combiners do achieve a bit of > that for you, as they start working immediately on the map out, but on a > per-mapper basis (not global), so easy to handle failure in that case (you > just redo that mapper and the combining for it). > > -- amr > > > Ted Dunning wrote: > >> I would consider this to be a very delicate optimization with little >> utility >> in the real world. It is very, very rare to reliably know how many >> records >> the reducer will see. Getting this wrong would be a disaster. Getting it >> right would be very difficult in almost all cases. >> >> Moreover, this assumption is baked all through the map-reduce design and >> thus doing a change to allow reduce to go ahead is likely to be really >> tricky (not that I know this for a fact). >> >> >> On Mon, Jul 6, 2009 at 11:14 AM, Naresh Rapolu < >> nareshreddy.rapolu@gmail.com >> >> >>> wrote: >>> >>> >> >> >> >>> My aim is to make the reduce move ahead with reduction as and when it >>> gets >>> the data required, instead of waiting for all the maps to complete. If >>> it >>> knows how many records it needs and compares it with number of records it >>> has got until now, it can move on once they become equal without waiting >>> for all the maps to finish. >>> >>> >>> >> >> >> > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals --0016364ed386a07a7e046e19338c--