hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3772) HDFS NN will hang in safe mode and never come out if we change the dfs.namenode.replication.min bigger.
Date Wed, 08 Aug 2012 13:50:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431115#comment-13431115
] 

Yanbo Liang commented on HDFS-3772:
-----------------------------------

After survey the source code, I found that after modifying the minimum replication, the HDFS
persistent storage has no information about the former minimum replication. So we can not
get the it after restart. So we need to think out of box.
I have found the blockThreshold(dfs.namenode.safemode.threshold-pct) specifies the percentage
of blocks that should satisfy the minimal replication requirement defined by dfs.namenode.replication.min.
I think we can change the sementics of this parameter to the percentage of blocks that satisfy
the real replication of each file. And then we compare the NN received replication with the
real replication of each file. If they are equal, we increment blockSafe.
Anyone who have some opinions? If this scenario is ok, I will fix it.
                
> HDFS NN will hang in safe mode and never come out if we change the dfs.namenode.replication.min
bigger.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3772
>                 URL: https://issues.apache.org/jira/browse/HDFS-3772
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Yanbo Liang
>
> If the NN restarts with a new minimum replication (dfs.namenode.replication.min), any
files created with the old replication count will expected to bump up to the new minimum upon
restart automatically. However, the real case is that if the NN restarts will a new minimum
replication which is bigger than the old one, the NN will hang in safemode and never come
out.
> The corresponding test case can pass is because we have missing some test coverage. It
had been discussed in HDFS-3734.
> If the NN received enough number of reported block which is satisfying the new minimum
replication, it will exit safe mode. However, if we change a bigger minimum replication, there
will be no enough amount blocks which are satisfying the limitation.
> Look at the code segment in FSNamesystem.java:
> private synchronized void incrementSafeBlockCount(short replication) {
>       if (replication == safeReplication) {
>         this.blockSafe++;
>         checkMode();
>       }
>     }
> The DNs report blocks to NN and if the replication is equal to safeReplication(It is
assigned by the new minimum replication.), we will increment blockSafe. But if we change a
bigger minimum replication, all the blocks whose replications are lower than it can not satisfy
this equal relationship. But actually the NN had received complete block information. It cause
blockSafe will not increment as usual and not reach the enough amount to exit safe mode and
then NN hangs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message