hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: balancing and replication in HDFS
Date Fri, 25 Feb 2011 23:12:32 GMT
Hi Jeff,
The output of terasort has replication level 1 by default. This is so
it goes faster with the default settings and makes for more impressive
benchmark results :)
The reason you see it all on one machine is probably that you're
running with one reducer. Try configuring your terasort to use more
reduce tasks and you should see the load (and space usage) even out.

On Fri, Feb 25, 2011 at 2:52 PM, Jeffrey Buell <jbuell@vmware.com> wrote:
> I'm a newbie to hadoop and HDFS.  I'm seeing odd behavior in HDFS that I hope somebody
can clear up for me.  I'm running hadoop version 0.20.1+169.127 from the cloudera distro
on 4 identical nodes, each with 4 cpus and 100GB disk space.  Replication is set to 2.
> I run:
> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar teragen 50000000 tera_in5
> This produces the expected 10GB of data on disk (5GB * 2 copies).  But the data is spread
very unevenly across the nodes, ranging from 1.7 to 3.2 GB on each node.  Then I sort the
> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar terasort tera_in5 tera_out5
> It finishes successfully, and HDFS recognizes the right amount of data:
> $ hadoop fs -du /user/hadoop/
> Found 2 items
> 5000023410  hdfs://namd-1/user/hadoop/tera_in5
> 5000170993  hdfs://namd-1/user/hadoop/tera_out5
> However all the new data is on one node (apparently randomly chosen), and the total disk
usage is only 15GB, which means that the output data is not replicated.  For nearly all the
elapsed time of the sort, the other 3 nodes are idle.  Some of the output data is in dfs/data/current,
but a lot is in one of 64 new subdirs (dfs/data/current/subdir0 through subdir63).
> Why is all this happening?  Am I missing some tunables that make HDFS do the right balance
and replication?
> Thanks,
> Jeff

Todd Lipcon
Software Engineer, Cloudera

View raw message