hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Wang <andrew.w...@cloudera.com>
Subject Re: Why do reads take as long as replicated writes?
Date Tue, 04 Nov 2014 23:42:23 GMT
I would advise against using TestDFSIO, instead trying TeraGen and
TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality,
so it's not very good if you have a cluster bigger than your replication
factor. You might be network bound as you try to read more files.

Best,
Andrew

On Tue, Nov 4, 2014 at 6:19 AM, Eitan Rosenfeld <eitan27@gmail.com> wrote:

> I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO
> on
> Hadoop 1.0.4.  For simplicity, I turned off speculative task execution and
> set
> the max map and reduce tasks to 1.
>
> With a replication factor of 2, writing 1 file of 5GB takes twice as long
> as
> reading 1 file. This result seems to make sense since the replication
> results
> in twice the I/O in the cluster versus the read. However, as I scale up the
> number of 5GB files from 1 to 64 files, reading ultimately takes as long as
> writing. In particular, I see this result when writing and reading 64
> such files.
>
> What could cause read performance to degrade faster than write performance
> as the number of files increases?
>
> The full results (number of 5GB files, ratio of write time to read
> time) are below:
> 1,  2.02
> 2,  1.87
> 4,  1.73
> 8,  1.54
> 16,  1.37
> 32,  1.29
> 64,  1.01
>
> Thank you,
>
> Eitan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message