hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.
Date Wed, 04 May 2016 20:05:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271340#comment-15271340
] 

Kihwal Lee commented on HDFS-10365:
-----------------------------------

I checked one of our mid-size clusters with 2,400 nodes and 120M blocks. During the last start-up,
the start-up safe mode was about 14 minutes long. But it runs 2.7.2 +  patches, so it might
behave a bit differently from yours. There are few things that affect the safe mode time greatly:
- A bigger initial delay so that reports are more widely spaced
- Sufficiently big CMS "young gen" size to absorb influx of large requests. So far CMS seems
to work better than G1 for big namenodes.
- Have datanodes break up full block reports by storage. This  makes each FBR RPC smaller,
so the impact of timeout-retransmit can be lower.

I think the number of handlers is too high. It makes the call queue bigger, so more things
will queue up and timeout.

> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10365
>                 URL: https://issues.apache.org/jira/browse/HDFS-10365
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0
>         Environment: version - hadoop-2.6.0 (hdp-2.2)
> DN - 1200 nodes
>            Reporter: Chackaravarthy
>            Priority: Critical
>
> Whenever NN is restarted, it takes huge time for NN to come back to stable state. i.e.
Last contact time remains more than 1 or 2 mins continuously for around 3 to 4 hours. This
is mainly because most of the DN's getting timeout (60s) in blockReport (FBR) rpc call and
then it keep sending FBR again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message