Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 42855 invoked from network); 24 Feb 2009 19:51:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Feb 2009 19:51:29 -0000 Received: (qmail 15208 invoked by uid 500); 24 Feb 2009 19:51:24 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 15168 invoked by uid 500); 24 Feb 2009 19:51:24 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 14916 invoked by uid 99); 24 Feb 2009 19:51:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2009 11:51:23 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2009 19:51:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0ACAB234C4B0 for ; Tue, 24 Feb 2009 11:51:02 -0800 (PST) Message-ID: <1695905738.1235505062043.JavaMail.jira@brutus> Date: Tue, 24 Feb 2009 11:51:02 -0800 (PST) From: "Raghu Angadi (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode In-Reply-To: <178281774.1225759904336.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676384#action_12676384 ] Raghu Angadi commented on HADOOP-4584: -------------------------------------- I don't think read/write helps much (since we can't make writes wait for that long) or necessary for this. It is not very complicated to scan without locking.. DN just needs to keep track of changes (additions and deletions) that happen from the time scan starts to the time it ends and appy those changes at the end. That way scan can proceed without locking the storage. Another way to achieve the same to keep a modification timestamp with each block and at the end of the scan, if a in memory block has a timestamp older than start of the scan but not found in scan, then it should be deleted.. etc. None the less it not as easy as not scanning :). I think Suresh had more elaborate description of the above with examples. > Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens of minutes to generate a block report. It causes the datanode not able to send heartbeat to NameNode every 3 seconds. In the worst case, it makes NameNode to detect a lost heartbeat and wrongly decide that the datanode is dead. > It would be nice to have two threads instead. One thread is for scanning data directories and generating block report, and executes the requests sent by NameNode; Another thread is for sending heartbeats, block reports, and picking up the requests from NameNode. By having these two threads, the sending of heartbeats will not get delayed by any slow block report or slow execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.