hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Date Mon, 03 Apr 2006 15:51:04 GMT
An observation...  this whole thread is about limits caused by type  
safety.  Interestingly, the other implementation of map-reduce does  
not support types at all.  Everything is a string.

So I agree that our departure from the paper is the problem.  ;-)

I'm comfortable letting this lie for a while.  But I predict we've  
not heard the last of it.

On Apr 2, 2006, at 10:29 PM, Doug Cutting wrote:

> Runping Qi wrote:
>> The argument of using local combiners is interesting. To me,  
>> combiner class
>> is just another layer of transformer.  It does not mean that the  
>> combiner
>> class has to be the same as the reducer class. The only criteria  
>> is that
>> they meet the associate rule:  	Let L1, L2, ..., Ln and K1,  
>> K2, .., Km be two partitions of S, then 	Reduce(list(Combiner(L1),  
>> Combiner(L2),..., Combiner(Ln))) and 	Reduce(list(Combiner(K1),  
>> Combiner(K2), ..., Combiner(Km)) are the
>> same.
>> A special (maybe very common) scenario is that combiner and  
>> reducer are the
>> same class and reduce function is associate. However, this needs  
>> not to be
>> the case in general. And the class of the reduce outputs need not  
>> to be the
>> same as that of the combiner, if the combiner and the reducer are  
>> not the
>> same class.
> This indeed may be be an intriguing generalization of the MapReduce  
> model.  But it does add more possible failure modes.  At present we  
> have far too few unit tests for the existing, simpler MapReduce  
> model, and the platform is still shakey.  Thus I am reluctant to  
> spend a lot of extending the model in ways that are not absolutely  
> essential.
> My goal is for Hadoop to be widely used.  I do not feel that the  
> power of the MapReduce model is currently a primary bottleneck to  
> wider adoption.  The larger issues we face are performance,  
> reliability, scalability and documentation.
> If I am to commit a patch, then I must feel that I can support and  
> maintain it, that it fits within my priorities.  Otherwise, if it  
> causes problems that I don't have time to attend to (even if this  
> only means reviewing and testing fixes submitted by others) then  
> the quality of the system will decrease, a vector we must avoid.
> Currently we have just four committers on Hadoop.  For Mike and  
> Andrzej, Nutch is a secondary effort.  Owen has been voted in as a  
> Hadoop committer, but his paperwork is not yet complete.  So I am  
> the bottleneck.  I spend a lot of time on annoying yet critical  
> issues like making sure that recent extensions to Hadoop don't  
> break Nutch running in pseudo-distributed mode on Windows.
> I don't particularly like things this way, but that's where we are  
> right now.  The best way to get out of here is for folks who'd like  
> to be committers to submit high-quality, well documented, well- 
> formatted, non-disruptive, unit-test-bearing patches that are easy  
> for me to apply and make Hadoop easier to use and more reliable,  
> thus earning points towards becoming committers.  If we have more  
> committers then we should be able to advance with confidence on  
> more fronts in parallel.
> Doug

View raw message