hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Sreekumar <hsreeku...@clickable.com>
Subject Some basic questions on replication
Date Thu, 04 Nov 2010 17:54:46 GMT

 I have some pretty basic stuff on replication that I am no very clear
about, even after reading the online docs..

1. My understanding is that replication factor of x means any block of data
in HDFS will be available, given enough time, at x different nodes. I have a
confusion whether it is x nodes or x locations or x disks? e.g, if I have
replication set to 3 on a single node setup with one physical disk, we'll
have the same data at 3 locations on the hard drive? What if I have 3 disks
on the node?

2. If I have a 3 node setup with replication set to 1, and I upload into it
a 3 GB file, it means that 1 GB of the file, approximately, will be
available on each node, right?

3. If I run a mapreduce job on the 3 GB file above, will there be any data
transfer between the nodes for the map phase? The optimizer will try to
assign tasks in a way that each node uses the locally available data, so
each Node will run the map function based on locally available data, right?
In the reduce phase, of course, the map outputs will be shuffled.

I am asking these questions because in a recent test Map-reduce job I ran on
a 2.13 GB file (3 node cluster), the job competed in 40 s with
replication=3, but took 1 min 45 s with replication=1. What could be the
reason for this? Can network latency be a reason? The job is simply an
aggregation where map returns IntWritable(1)s and reduce just sums it up.

Thanks in advance,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message