hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From brien colwell <xcolw...@gmail.com>
Subject Re: io performance
Date Mon, 30 Nov 2009 22:22:49 GMT

May be of help ... In my experience there is not a single bottleneck.
Even your tuple representation may disproportionally impact performance.
Too-granular tuples resulting in redundant values will slow down the
shuffle, which does at least 2 serialize/de-serialize ops per tuple.
Benchmarks on my internal cluster show HDFS seek times around 25ms,
higher than a typical physical disk. Hence I'm under the impression the
number of input/output files affects IO performance more than on a
typical physical disk.

A paper from Berkeley EECS about pipelining between Reduce and Map may
be of interest.

Gang Luo wrote:
> Hi all,
> I got some questions about the performance of IO in hadoop. What I want to know is which
of the three phases (mapper input, shuffle and reducer output) is the bottleneck.
> First, compared with the shuffle phase (mapper output, data transfer over network and
reducer write it to local disk), does it cost more time to read the input source to mappers?
My concern is that to read a large table also need to transfer different splits to different
mappers over network. Is hadoop smart enough to make input consume less time than shuffle?
> Second, for the final result will be stored in HDFS, we need to duplicate several copies
(by default, 3 copies). This need to transfer more than one copy of the result over network.
If we run another mapreduce job which is dependent on this result, I think the latency for
replicating the result is non-trivial, right? What is the performance compared with mapper
input and shuffle?
> Thanks.
> Gang Luo
> ---------
> Department of Computer Science
> Duke University
> (919)316-0993
> gang.luo@duke.edu
>       ___________________________________________________________ 
>   好玩贺卡等你发,邮箱贺卡全新上线! 
> http://card.mail.cn.yahoo.com/

View raw message