hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Some basic questions on replication
Date Thu, 04 Nov 2010 20:33:18 GMT
Hello again,

On Fri, Nov 5, 2010 at 12:52 AM, Hari Sreekumar
<hsreekumar@clickable.com> wrote:
> Hi Harsh,
>             Thanks for the reply. So if I have a 2048 MB file with 64 MB
> block size (32 blocks) with replication 3, then I'll have 96 blocks of the
> file on HDFS, with no two similar blocks being on the same datanode. Also,
> if I change the dfs.replication property, does it effect files already in
> HDFS or is it valid only for new files that will be uploaded into HDFS? Is
> there a way to rebalance the cluster based on the new replication factor?

Replication is file-based. It will not affect existing files, if you
restart DataNodes with a different replication factor value.

There's a "setrep" command you could use to reset replication value of
files. See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep

> And if I have replication set to 3, do all the 3 disk writes happen
> simultaneously or is there some background process which does the
> replication? If not, then increasing replication would lead to more writes
> and thus reduce performance of any write-intensive job, am I right?

They do not happen simultaneously. The NameNode determines the work of
replicating blocks to DataNodes every nth interval, set by
"dfs.namenode.replication.interval" to 3s by default. I believe that
the load of the nodes are also considered while attempting to assign a
replication work between DataNodes.

I haven't seen increasing/decreasing replication factors affecting the
performance of MapReduce jobs when it comes to writing, but yes I
suppose it could, in case of decommissioning a node and/or
re-balancing, lower the network transfer rates that the jobs may

Harsh J

View raw message