hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Introducing Cloud MapReduce
Date Wed, 02 Dec 2009 07:39:52 GMT

On Nov 27, 2009, at 9:41 AM, Bruce Snyder wrote:

> 1) It is faster than other implementations (e.g., 60 times faster than
> Hadoop in one case. Speedup depends on the application and data.).

Based on your report, it looks like your comparison was done using  
1.2GB of data in 92,000 files. If you took the default configuration,  
you'll end up with 92,000 maps, each of them processing a very small  
input. You either need to use MultiFileInputFormat or use Hadoop  
Archives to bundle your tiny files into reasonable size. For input  
data that small you should not have more than 8 maps unless each map  
is doing a lot of CPU work.

In your reduce push iterator model, you give up a lot of determinism  
in your application. Many applications that run in Hadoop depend on  
the reduces getting keys in sorted order. Getting keys in a random  
order won't work. Furthermore, I don't see any way that you can scale  
up your approach. You effectively need to hold open all of the reduce  
states for the keys you've seen and can't close them until it is done.  
Again, that will quickly exhaust memory. So the push model will work  
for something like word count, but fail on large problems.

When you add types in, you should add them in as:

K1, V1 -> map -> K2,V2 -> reduce -> K3,V3

with the combiner taking:

K2,V2 -> combiner -> K2,V2

Of course the types are often not primitives and using something like  
Avro is a good thing.

-- Owen

Mime
View raw message