Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E4B4104CB for ; Fri, 19 Apr 2013 22:35:16 +0000 (UTC) Received: (qmail 82064 invoked by uid 500); 19 Apr 2013 22:35:16 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 82031 invoked by uid 500); 19 Apr 2013 22:35:16 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 82022 invoked by uid 99); 19 Apr 2013 22:35:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Apr 2013 22:35:15 +0000 Date: Fri, 19 Apr 2013 22:35:15 +0000 (UTC) From: "Varun Sharma (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-3703) Decrease the datanode failure detection time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636959#comment-13636959 ] Varun Sharma commented on HDFS-3703: ------------------------------------ I actually am seeing an interesting race condition though for under recovery blocks... A block is being written to and the DN server is lost (no route to it). Stale timeout = 20 seconds. a) The DN holds a lease which expires in 60 seconds b) The block is moved to UNDER_RECOVERY state since it was still being written to using the append api c) The recovery chooses a primary data node which happens to be the stale datanode and issues a recover block command for this partially written block d) This does not succeed, hence other DN(s) are now tried. During recovery the primary DN's job is to reconcile the data nodes on all 3 replicas. Since the stale node does not seem to kick in here, this primary DN tries to also reconcile the lost DN and times out after 15 minutes (20 * 45 retries). The same thing goes over and over again and the recovery fails. The block seems to be lost IMHO. In the meanwhile a client tries to recover the lease on this file+block, it gets it after the expiration of 60 seconds and it tries to read this block, eventually it is redirected to the bad datanode even though this is > 20 seconds post failure. Would it be nice to actually fix this issue and have the primary DN never be the stale DN and also the reconcilation should not involve the bad DN since the recovery never truly happens ? THanks Varun > Decrease the datanode failure detection time > -------------------------------------------- > > Key: HDFS-3703 > URL: https://issues.apache.org/jira/browse/HDFS-3703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Affects Versions: 1.0.3, 2.0.0-alpha, 3.0.0 > Reporter: Nicolas Liochon > Assignee: Jing Zhao > Fix For: 1.1.0, 2.0.3-alpha > > Attachments: 3703-hadoop-1.0.txt, HDFS-3703-branch-1.1-read-only.patch, HDFS-3703-branch-1.1-read-only.patch, HDFS-3703-branch2.patch, HDFS-3703.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-read-only.patch, HDFS-3703-trunk-with-write.patch > > > By default, if a box dies, the datanode will be marked as dead by the namenode after 10:30 minutes. In the meantime, this datanode will still be proposed by the nanenode to write blocks or to read replicas. It happens as well if the datanode crashes: there is no shutdown hooks to tell the nanemode we're not there anymore. > It especially an issue with HBase. HBase regionserver timeout for production is often 30s. So with these configs, when a box dies HBase starts to recover after 30s and, while 10 minutes, the namenode will consider the blocks on the same box as available. Beyond the write errors, this will trigger a lot of missed reads: > - during the recovery, HBase needs to read the blocks used on the dead box (the ones in the 'HBase Write-Ahead-Log') > - after the recovery, reading these data blocks (the 'HBase region') will fail 33% of the time with the default number of replica, slowering the data access, especially when the errors are socket timeout (i.e. around 60s most of the time). > Globally, it would be ideal if HDFS settings could be under HBase settings. > As a side note, HBase relies on ZooKeeper to detect regionservers issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira