hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Fraison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13295) Namenode doesn't leave safemode if dfs.namenode.safemode.replication.min set < dfs.namenode.replication.min
Date Mon, 19 Mar 2018 08:20:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404483#comment-16404483
] 

Nicolas Fraison commented on HDFS-13295:
----------------------------------------

Hi [~bharatviswa]
Yes I have faced this issue on my cluster.
At the beginning number of blocks to report were decreasing as usual. After reaching 1k to
10 k blocks it then start to increase back almost every minute (that's why I was thinking
of modification from edits logs)
To understand what happens I attached a debugger on namenode and see that block reported by
BlockManager.completeBlock were always send with a minimum of 2 which was greater to the min
replication safemode I set.
I think that if there are not enough changes on the cluster of if the namenode start fastly,
you will probably not hit this issue.
After applying the patch I don't see anymore the issue.

Here is the thread stack of the EditLogTailer (from the cdh5.11 we use) were I can see those
call to incrementSafeBlockCount with replication number set at 2:
{code}
"Edit log tailer@6272" prio=5 tid=0x55b nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.completeBlock(BlockManager.java:811)
	  at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.completeBlock(BlockManager.java:823)
	  at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:836)
	  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:954)
	  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:433)
	  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:231)
	  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:140)
	  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:856)
	  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:837)
	  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
	  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
	  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
	  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
	  at java.security.AccessController.doPrivileged(AccessController.java:-1)
	  at javax.security.auth.Subject.doAs(Subject.java:360)
	  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1900)
	  at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)
	  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
{code}

Value of ucBlock (curBlock in trunk) in completeBlock:
{code}
blk_1155343062_81635483{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-c175308c-644d-48dc-bbf1-bf3e95a656de:NORMAL:10.60.32.151:1004|FINALIZED],
ReplicaUnderConstruction[[DISK]DS-59732d1e-6181-4595-b0aa-376c44e1d1fd:NORMAL:10.60.32.236:1004|FINALIZED],
ReplicaUnderConstruction[[DISK]DS-a0924255-bb90-45fb-9d67-2498c6e6559d:NORMAL:10.60.32.155:1004|FINALIZED]]}
{code}

> Namenode doesn't leave safemode if dfs.namenode.safemode.replication.min set < dfs.namenode.replication.min
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13295
>                 URL: https://issues.apache.org/jira/browse/HDFS-13295
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>         Environment: CDH 5.11 with HDFS-8716 backported.
> dfs.namenode.replication.min=2
> dfs.namenode.safemode.replication.min=1
>  
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Major
>         Attachments: HDFS-13295.patch
>
>
> When we set dfs.namenode.safemode.replication.min < dfs.namenode.replication.min from
HDFS-8716 patch the number of replica for which it will increase the safe block count
> must be equal to dfs.namenode.safemode.replication.min in `FSNamesystem.incrementSafeBlockCount`
> When reading modification from edits, the replica number for new blocks is set at min(numNodes,
> dfs.namenode.replication.min) in BlockManager.completeBlock which is greater than dfs.namenode.safemode.replication.min.
> Due to that safe block count never reach number of available blocks and namenode doesn't
leave automatically the safemode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message