hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <huan....@accenture.com>
Subject RE: Introducing Cloud MapReduce
Date Thu, 03 Dec 2009 17:59:09 GMT

Great questions. You are hitting some key points, answers are inline.

> > 1) It is faster than other implementations (e.g., 60 times faster
> than
> > Hadoop in one case. Speedup depends on the application and data.).
> Based on your report, it looks like your comparison was done using
> 1.2GB of data in 92,000 files. If you took the default configuration,
> you'll end up with 92,000 maps, each of them processing a very small
> input. You either need to use MultiFileInputFormat or use Hadoop
> Archives to bundle your tiny files into reasonable size. For input
> data that small you should not have more than 8 maps unless each map
> is doing a lot of CPU work.

We used default because we were not sure how to optimize it. We can certainly rerun the test
with your suggestion. The point is not about the 60x speedup, it is about a potential bottleneck
in the master/slave architecture. When you scale up the number of slaves nodes and the number
of tasks, you will run into the same problem even if you use larger chunk size. Due to the
lack of access to a large cluster, we are not able to run an experiment to show that the master
node will choke at some point. This is essentially a scaled-down version of the same large-scale

> In your reduce push iterator model, you give up a lot of determinism
> in your application. Many applications that run in Hadoop depend on
> the reduces getting keys in sorted order. Getting keys in a random
> order won't work. 

Sorting at the end after the values have been reduced is much cheaper than sorting the whole
set of intermediate key-value pairs. We will add a function in our implementation to sort
at the end, for those applications requiring ordered output. 

> Furthermore, I don't see any way that you can scale
> up your approach. You effectively need to hold open all of the reduce
> states for the keys you've seen and can't close them until it is done.
> Again, that will quickly exhaust memory. So the push model will work
> for something like word count, but fail on large problems.

Our strategy is to have a large number of partitions to limit the number of keys in each partition.
Having a large number of partitions would not overload our system, so it is easy for us. As
I expressed in a related thread answering Todd's question, I believe the state each key holds
is small for most applications. Loved to be proved wrong, so that we have real motivations
to refine the architecture to handle that case.

> When you add types in, you should add them in as:
> K1, V1 -> map -> K2,V2 -> reduce -> K3,V3
> with the combiner taking:
> K2,V2 -> combiner -> K2,V2
> Of course the types are often not primitives and using something like
> Avro is a good thing.

Thanks for clarifying. Adding types is in the roadmap now. 



This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise private information.  If you have received it in error, please notify the sender
immediately and delete the original.  Any other use of the email by you is prohibited.

View raw message