hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Vishwanathan <rajv...@yahoo.com>
Subject Re: data distribution in HDFS
Date Mon, 02 Apr 2012 17:28:36 GMT
Stijn,

The first block of the data , is always stored in the local node. Assuming that you had a
replication factor of 3, the node that generates the data will get about 10GB of data and
the other 20GB will be distributed among other nodes.

RajĀ 





>________________________________
> From: Stijn De Weirdt <stijn.deweirdt@ugent.be>
>To: common-user@hadoop.apache.org 
>Sent: Monday, April 2, 2012 9:54 AM
>Subject: data distribution in HDFS
> 
>hi all,
>
>i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate
to see if i understand all.
>
>the test setup involves 5 nodes that all are tasktracker and datanode (and one node that
is also jobtracker and namenode on top of that. (this one node is running both the namenode
hadoop process as the datanode process)
>
>when i do the in teragen run, the data is not distributed equally over all nodes. the
node that is also namenode, get's a bigger portion of all the data. (as seen by df on the
nodes and by using dsfadmin -report)
>i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB)
>
>
>i use basic command lineĀ  teragen $((100*1000*1000)) /benchmarks/teragen, so i expect
100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit
more.)
>4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one
datanode is seen as 2 nodes.
>
>when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs
blocksize is 64MB.
>
>is there a reason why one datanode is preferred over the others.
>it is annoying since the terasort output behaves the same, and i can't use the full hdfs
space for testing that way. also, since more IO comes to this one node, the performance isn't
really balanced.
>
>many thanks,
>
>stijn
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message