hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Elenskiy <andrey.elens...@arista.com>
Subject Confusion between dfs.replication and dfs.namenode.replication.min options in hdfs-site.xml
Date Wed, 01 Feb 2017 21:37:09 GMT
Hello,

I use hadoop 2.7.3 non-HA setup with hbase 1.2.3 on top of it.

I'm trying to understand these options in hdfs-site.xml:

dfs.replication
3 Default block replication. The actual number of replications can be
specified when the file is created. The default is used if replication is
not specified in create time.
dfs.namenode.replication.min
1 Minimal block replication.
What I'm trying to do is to make sure that on write we always end up with 2
replicas minimum. In other words, a write should fail if we don't end up
with 2 replicas of each block.

As I understand, on write, hadoop creates a write pipeline of datanodes
where each datanode writes to the next one. Here's a diagram from Cloudera:
[image: Inline image 1]
Is it correct to say that the dfs.namenode.replication.min option controls
how many datanodes in the pipeline must have COMPLETEd the block in order
to consider a write successful and then acks to the client about success?
And dfs.replication option means that we eventually want to have this many
replicas of each block, but it doesn't need to be done at the write time
but could be done asynchronously later by the Namenode?

So, essentially, if I want a guarantee that I have one back up of each
block at all times, I need to set to dfs.namenode.replication.min=2. And,
if I want to make sure that I won't go into safemode on startup too often,
I should set dfs.replication = 3 to tolerate one replica loss.

Mime
View raw message