Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 98606 invoked from network); 8 Nov 2007 17:48:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Nov 2007 17:48:18 -0000 Received: (qmail 54003 invoked by uid 500); 8 Nov 2007 17:48:01 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 53977 invoked by uid 500); 8 Nov 2007 17:48:01 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 53964 invoked by uid 99); 8 Nov 2007 17:48:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2007 09:48:01 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2007 17:48:47 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B85C4714236 for ; Thu, 8 Nov 2007 09:47:50 -0800 (PST) Message-ID: <162001.1194544070752.JavaMail.jira@brutus> Date: Thu, 8 Nov 2007 09:47:50 -0800 (PST) From: "Jim Kellerman (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Created: (HADOOP-2173) [hbase] When the master times out a region servers lease, the region server may not restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [hbase] When the master times out a region servers lease, the region server may not restart ------------------------------------------------------------------------------------------- Key: HADOOP-2173 URL: https://issues.apache.org/jira/browse/HADOOP-2173 Project: Hadoop Issue Type: Bug Components: contrib/hbase Reporter: Jim Kellerman Hadoop-Nightly 297 failed because: * The region server's lease expired (Why? was the heartbeat thread starved?) * The region server gets a call startup message * The master splits the region server's log and deletes it. I think that when the region server called log.closeAndDelete(), it got an exception (because the file no longer existed) at that point it said "error restarting server" and quit. From there on the master is just looping because there is no region server to talk to We should probably just log an error for log.closeAndDelete() and proceed with region server restart. Also for that test, we should probably increase the lease timeout and make the lease timeout check happen less frequently accordingly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.