Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=mime-version:in-reply-to:references:content-type:message-id:
	content-transfer-encoding:from:subject:date:to:x-mailer;
	b=INDxe73rv47kCmg7BzoZqRqvJ+7CYXBrj9vThBMQ6VNo1OzpnML9f2zflyVvRdUm
Mime-Version: 1.0 (Apple Message framework v752.3)
In-Reply-To: <C2F4B5C5.1A57E%tdunning@veoh.com>
References: <C2F4B5C5.1A57E%tdunning@veoh.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <BB83A4DE-0CAC-43E0-8F5D-BA33C62808BD@yahoo-inc.com>
Content-Transfer-Encoding: 7bit
From: Eric Baldeschwieler <eric14@yahoo-inc.com>
Subject: Re: Poly-reduce?
Date: Fri, 24 Aug 2007 16:31:35 -0700
To: <hadoop-user@lucene.apache.org>

especially at scale!

And we are testing on >1000 node clusters with long jobs.  We see  
lots of failures per job.

On Aug 24, 2007, at 4:20 PM, Ted Dunning wrote:

>
>
>
> On 8/24/07 12:11 PM, "Doug Cutting" <cutting@apache.org> wrote:
>
> >
> >> Using the same logic, streaming reduce outputs to
> >> the next map and reduce steps (before the first reduce is complete)
> >> should also provide speedup.
> >
> > Perhaps, but the bookeeping required in the jobtracker might be  
> onerous.
> >   The failure modes are more complex, complicating recovery.
>
> Frankly, I find Doug's arguments about reliability fairly compelling.
>
> Map-reduce^n is not the same, nor is it entirely analogous to pipe- 
> style
> programming.  It feels the same, but there are very important  
> differences
> that I wasn't thinking of when I made this suggestion.  The most  
> important
> is the issue of reliability.  In a large cluster, failure is a  
> continuous
> process, not an isolated event.  As such, the problems of having to  
> roll
> back an entire program due to node failure are not something that  
> can be
> treated as unusual.  That makes Doug's comments about risk more on- 
> point
> than the potential gains.  It is very easy to imagine scenarios  
> where the
> possibility of program roll-back results in very large average run- 
> times
> while the chained reduce results in only incremental savings.  This  
> isn't a
> good thing to bet on.
>
>