hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.
Date Wed, 04 May 2016 21:03:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271433#comment-15271433
] 

Kihwal Lee commented on HDFS-10365:
-----------------------------------

bq. Are you suggesting to tune dfs.blockreport.split.threashold ...
Yes. Each FBR rpc will be smaller, so the impact of timeout-retransmit will be lower. Also
NN will process individual report quicker.

NN creates lots of temporary objects and their life time can be longer than expected, especially
when requests need to sit in the call queue for a long time. A larger CMS young gen has been
shown to absorb them well and the young gen collection time does not become excessive as most
objects get freed rather than copied. Of course it is a bit different during start-up when
the data structures are being built up.  We thought a high rate of allocation and freeing
will go well with what G1 is designed for, but so far we haven't found a magic recipe that
works better than CMS for large namenodes.   I hear it does a better job for HBase region
servers though.

> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10365
>                 URL: https://issues.apache.org/jira/browse/HDFS-10365
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0
>         Environment: version - hadoop-2.6.0 (hdp-2.2)
> DN - 1200 nodes
>            Reporter: Chackaravarthy
>            Priority: Critical
>
> Whenever NN is restarted, it takes huge time for NN to come back to stable state. i.e.
Last contact time remains more than 1 or 2 mins continuously for around 3 to 4 hours. This
is mainly because most of the DN's getting timeout (60s) in blockReport (FBR) rpc call and
then it keep sending FBR again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message