hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: bandwidth (Was: Re: Running on multiple CPU's)
Date Mon, 16 Apr 2007 17:49:19 GMT
jafarim wrote:
> On linux and jvm6 with normal IDE disks and a giga ethernet switch with
> corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by
> using the native libs provided in the package but then we tested again with
> distcp. The scenario was as follows:
> We ran the test on a cluster with 1 node, then we added the nodes one by 
> one
> until reaching 5 nodes. Same test with samba saturated the link with only
> one node.

How big were the files you were copying?  The distcp task uses mapreduce 
to copy each file as a separate task.  Each task launches in a new JVM, 
and the tasktrackers only poll for new tasks every few seconds.  So, 
with smaller files it would not be able to saturate a gigabit switch. 
Ideally each file should take 10 seconds or more to copy.  With a 
gigabit switch, this means a 1GB minimum filesize.

You could also try the single-threaded 'bin/hadoop hdfs -put'.

A comparison with Samba is not entirely fair, since HDFS provides 
different features.  For example, HDFS normally replicates data on three 
nodes, so writes consume twice or three times the bandwidth (depending 
on whether the source node is a datanode with space available).

Finally, 0.9 is a pretty old release.  Hadoop's performance and 
reliability has improved substantially in the past few months.

Doug

Mime
View raw message