hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
Date Thu, 06 Jun 2013 21:34:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677537#comment-13677537
] 

Kihwal Lee commented on HDFS-4832:
----------------------------------

+1 to the approach. This patch stops generation of new work and sending of remaining work.
Since replications queues are kept updated in manual safe mode, it is okay to skip reinitialization
of repl queues when exiting manual safe mode. HA is fine with this change; when SBN transitions
to active, the queues are cleared and  initialized unless the NN is in startup safe mode,
in which case repl queues are initialized later when exiting safe mode.

                
> Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-4832
>                 URL: https://issues.apache.org/jira/browse/HDFS-4832
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 0.23.7, 2.1.0-beta
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch
>
>
> Courtesy Karri VRK Reddy!
> {quote}
> 1. Namenode lost datanodes causing missing blocks
> 2. Namenode was put in safe mode
> 3. Datanode restarted on dead nodes 
> 4. Waited for lots of time for the NN UI to reflect the recovered blocks.
> 5. Forced NN out of safe mode and suddenly,  no more missing blocks anymore.
> {quote}
> I was able to replicate this on 0.23 and trunk. I set dfs.namenode.heartbeat.recheck-interval
to 1 and killed the DN to simulate "lost" datanode. The opposite case also has problems (i.e.
Datanode failing when NN is in safemode, doesn't lead to a missing blocks message)
> Without the NN updating this list of missing blocks, the grid admins will not know when
to take the cluster out of safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message