hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9917) IBR accumulate more objects when SNN was down for sometime.
Date Mon, 28 Mar 2016 08:13:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213947#comment-15213947
] 

Vinayakumar B commented on HDFS-9917:
-------------------------------------

Current changes for clearing IBRs on re-Register() looks good.

For the second part, i.e. Avoid accumulation of IBRs when the standby is down for long  time,
can we consider as below. (Already mentioned in my above comment)

1. IBRs for StandbyNN can have a threshold ( say 100K or 1Million IBRs ).
2. Also not to loose any important IBRs, IBRs can be cleared when "the threshold is reached
AND 'lastIBR' is more than 'heartbeatExpiryInterval'. i.e. DataNode is considered dead in
Namenode side".  In that case, for sure re-Register() will be called on reconnection to running
NameNode (if any).

Only question is, *heartBeatExpiryInterval* in NameNode depends on conf "dfs.namenode.heartbeat.recheck-interval"
which is namenode side configuration. By default this is 5 min. If there is any change in
this in Namenode side, that change should also be present in datanode config. Is it okay to
use this? or introduce a common conf to NN and DN? 

[~szetszwo], what is your opinion in this?

> IBR accumulate more objects when SNN was down for sometime.
> -----------------------------------------------------------
>
>                 Key: HDFS-9917
>                 URL: https://issues.apache.org/jira/browse/HDFS-9917
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>         Attachments: HDFS-9917.patch
>
>
> SNN was down for sometime because of some reasons..After restarting SNN,it became unreponsive
because 
> - 29 DN's sending IBR in each 5 million ( most of them are delete IBRs), where as each
datanode had only ~2.5 million blocks.
> - GC can't trigger on this objects since all will be under RPC queue. 
> To recover this( to clear this objects) ,restarted all the DN's one by one..This issue
happened in 2.4.1 where split of blockreport was not available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message