hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC
Date Fri, 25 Sep 2015 23:36:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908887#comment-14908887
] 

Hudson commented on HDFS-9107:
------------------------------

FAILURE: Integrated in Hadoop-trunk-Commit #8521 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8521/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp via Colin P.
McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -----------------------------------------------------
>
>                 Key: HDFS-9107
>                 URL: https://issues.apache.org/jira/browse/HDFS-9107
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 2.8.0
>
>         Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an infinite
cycle of full GCs.  The most common situation that precipitates an unrecoverable state is
a network issue that temporarily cuts off multiple racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the replication queues
which increases memory pressure. The replications create a flurry of incremental block reports
and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration which requires
a full block report - more memory pressure. The NN now has to invalidate all the over-replicated
blocks. The extra blocks are added to invalidation queues, tracked in an excess blocks map,
etc - much more memory pressure.
> All the memory pressure can push the NN into another full GC which repeats the entire
cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message