hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Date Mon, 03 Apr 2006 05:29:09 GMT
Runping Qi wrote:
> The argument of using local combiners is interesting. To me, combiner class
> is just another layer of transformer.  It does not mean that the combiner
> class has to be the same as the reducer class. The only criteria is that
> they meet the associate rule:  
> 	Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then 
> 	Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and 
> 	Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
> same.
> A special (maybe very common) scenario is that combiner and reducer are the
> same class and reduce function is associate. However, this needs not to be
> the case in general. And the class of the reduce outputs need not to be the
> same as that of the combiner, if the combiner and the reducer are not the
> same class.

This indeed may be be an intriguing generalization of the MapReduce 
model.  But it does add more possible failure modes.  At present we have 
far too few unit tests for the existing, simpler MapReduce model, and 
the platform is still shakey.  Thus I am reluctant to spend a lot of 
extending the model in ways that are not absolutely essential.

My goal is for Hadoop to be widely used.  I do not feel that the power 
of the MapReduce model is currently a primary bottleneck to wider 
adoption.  The larger issues we face are performance, reliability, 
scalability and documentation.

If I am to commit a patch, then I must feel that I can support and 
maintain it, that it fits within my priorities.  Otherwise, if it causes 
problems that I don't have time to attend to (even if this only means 
reviewing and testing fixes submitted by others) then the quality of the 
system will decrease, a vector we must avoid.

Currently we have just four committers on Hadoop.  For Mike and Andrzej, 
Nutch is a secondary effort.  Owen has been voted in as a Hadoop 
committer, but his paperwork is not yet complete.  So I am the 
bottleneck.  I spend a lot of time on annoying yet critical issues like 
making sure that recent extensions to Hadoop don't break Nutch running 
in pseudo-distributed mode on Windows.

I don't particularly like things this way, but that's where we are right 
now.  The best way to get out of here is for folks who'd like to be 
committers to submit high-quality, well documented, well-formatted, 
non-disruptive, unit-test-bearing patches that are easy for me to apply 
and make Hadoop easier to use and more reliable, thus earning points 
towards becoming committers.  If we have more committers then we should 
be able to advance with confidence on more fronts in parallel.


View raw message