hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chackaravarthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.
Date Wed, 04 May 2016 20:26:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271381#comment-15271381
] 

Chackaravarthy commented on HDFS-10365:
---------------------------------------

Hi [~kihwal], Thanks for the pointers. In our cluster, NN is configured with G1. 
In our case also, NN comes out of safe mode in 15 or 20 mins. But still it is flooded with
FBR from DN's as all FirstFBR gets timed out and NN gets error only while sending output (but
updates its state and comes out of safe mode).
{quote}
Have datanodes break up full block reports by storage. This makes each FBR RPC smaller, so
the impact of timeout-retransmit can be lower.
{quote}
Are you suggesting to tune {{dfs.blockreport.split.threashold}} to make DN to send FBR per
storage? currently average total blocks per DN is 200k around. So if I reduce {{dfs.blockreport.split.threashold}}
from 1Million (default) to 100k or 150k, then this would make FBR RPC smaller. Is this what
you meant?

> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10365
>                 URL: https://issues.apache.org/jira/browse/HDFS-10365
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0
>         Environment: version - hadoop-2.6.0 (hdp-2.2)
> DN - 1200 nodes
>            Reporter: Chackaravarthy
>            Priority: Critical
>
> Whenever NN is restarted, it takes huge time for NN to come back to stable state. i.e.
Last contact time remains more than 1 or 2 mins continuously for around 3 to 4 hours. This
is mainly because most of the DN's getting timeout (60s) in blockReport (FBR) rpc call and
then it keep sending FBR again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message