Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CD8C118332 for ; Thu, 6 Aug 2015 21:28:12 +0000 (UTC) Received: (qmail 9968 invoked by uid 500); 6 Aug 2015 21:28:06 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 9872 invoked by uid 500); 6 Aug 2015 21:28:06 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 9801 invoked by uid 99); 6 Aug 2015 21:28:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2015 21:28:06 +0000 Date: Thu, 6 Aug 2015 21:28:06 +0000 (UTC) From: "Rushabh S Shah (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-8869) Don't mark storages as failed before first block report MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Rushabh S Shah created HDFS-8869: ------------------------------------ Summary: Don't mark storages as failed before first block report Key: HDFS-8869 URL: https://issues.apache.org/jira/browse/HDFS-8869 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.0 Reporter: Rushabh S Shah Assignee: Daryn Sharp Creating this ticket on behalf of [~daryn]. Heartbeat processing performs the failed storage check. The DN reports its storages and any prior missing storages, ex. unique storage id upgrade, are marked failed. The heartbeat monitor removes all blocks associated to the failed storage. A replication storm ensues for all blocks on the node. Eventually the DN block reports for the new storages - up to 15m later for large clusters. Now the NN has many excess blocks to invalidate. If the cluster has failed over in the past 24h, ex. rolling upgrade, the standby gone active will queue the block invalidations which triggers the severe performance degradation of HDFS-8674 which has been greatly lessened but is still an issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)