hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.
Date Wed, 04 May 2016 20:05:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271340#comment-15271340

Kihwal Lee commented on HDFS-10365:

I checked one of our mid-size clusters with 2,400 nodes and 120M blocks. During the last start-up,
the start-up safe mode was about 14 minutes long. But it runs 2.7.2 +  patches, so it might
behave a bit differently from yours. There are few things that affect the safe mode time greatly:
- A bigger initial delay so that reports are more widely spaced
- Sufficiently big CMS "young gen" size to absorb influx of large requests. So far CMS seems
to work better than G1 for big namenodes.
- Have datanodes break up full block reports by storage. This  makes each FBR RPC smaller,
so the impact of timeout-retransmit can be lower.

I think the number of handlers is too high. It makes the call queue bigger, so more things
will queue up and timeout.

> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>                 Key: HDFS-10365
>                 URL: https://issues.apache.org/jira/browse/HDFS-10365
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0
>         Environment: version - hadoop-2.6.0 (hdp-2.2)
> DN - 1200 nodes
>            Reporter: Chackaravarthy
>            Priority: Critical
> Whenever NN is restarted, it takes huge time for NN to come back to stable state. i.e.
Last contact time remains more than 1 or 2 mins continuously for around 3 to 4 hours. This
is mainly because most of the DN's getting timeout (60s) in blockReport (FBR) rpc call and
then it keep sending FBR again.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message