hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Wang <andrew.w...@cloudera.com>
Subject Re: Understanding Network Utilization of TeraSort
Date Tue, 13 Jan 2015 06:00:31 GMT
Hi Eitan,

Is it possible you have speculative execution enabled? Check to make sure
the # of tasks being run matches up with your expectations.

You could also try running the same measurements for TeraGen with different
replication factors, for another comparison point.


On Fri, Jan 9, 2015 at 6:32 PM, Eitan Rosenfeld <eitan27@gmail.com> wrote:

> My goal is to see how the performance and network utilization of TeraSort
> is affected by varying the replication factor from 1-3 on my 16-node
> cluster. (I have modified TeraSort such that it uses my system's
> replication factor.) I am sorting 100GB.
> In particular, I am confused by the network utilization. With 1 replica,
> the network utilization is under 1GB. With 2 replicas, it is about 117GB.
> And with 3 replicas, it is about 225-230GB.
> I understand that just replicating the 100GB of sorted data causes 100GB
> and 200GB of network traffic in the 2 and 3 replica configurations,
> respectively. However, what accounts for the extra 17GB and 25-30GB in the
> 2 and 3 replica configs? And what accounts for the minimal network usage in
> the 1 replica configuration?
> Note that the data is generated with TeraGen using the same replication
> factor with which it is later sorted.
> Thank you,
> Eitan Rosenfeld

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message