Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 93715 invoked from network); 29 Dec 2007 01:20:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Dec 2007 01:20:07 -0000 Received: (qmail 63509 invoked by uid 500); 29 Dec 2007 01:19:55 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 63474 invoked by uid 500); 29 Dec 2007 01:19:55 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 63465 invoked by uid 99); 29 Dec 2007 01:19:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Dec 2007 17:19:55 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Dec 2007 01:19:51 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 331F6714204 for ; Fri, 28 Dec 2007 17:19:43 -0800 (PST) Message-ID: <20208062.1198891183207.JavaMail.jira@brutus> Date: Fri, 28 Dec 2007 17:19:43 -0800 (PST) From: "Chris Kline (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Created: (HADOOP-2500) [HBase] Unreadable region kills region servers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [HBase] Unreadable region kills region servers ---------------------------------------------- Key: HADOOP-2500 URL: https://issues.apache.org/jira/browse/HADOOP-2500 Project: Hadoop Issue Type: Bug Components: contrib/hbase Environment: CentOS 5 Reporter: Chris Kline Backgound: The name node (also a DataNode and RegionServer) in our cluster ran out of disk space. I created some space, restarted HDFS and fsck reported corruption with an HBase file. I cleared up that corruption and restarted HBase. I was still unable to read anything from HBase even though HSFS was now healthy. The following was gather from the log files. When HMaster starts up, it finds a region that is no good (Key: 17_125736271): 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current assignment of spider_pages,17_125736271,1198286140018 is no good HMaster then assigns this region to RegionServer X.60: 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020 The RegionServer has trouble reading that region (from the RegionServer log on X.60); Note that the worker thread exits 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting spider_pages,17_125736271,1198286140018/meta (2062710340/meta with reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum sequence id for hstore spider_pages,17_125736271,1198286140018/meta (2062710340/meta) is 4549496 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,17_125736271,1198286140018 java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1360) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1349) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1344) at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697) at org.apache.hadoop.hbase.HStore.(HStore.java:632) at org.apache.hadoop.hbase.HRegion.(HRegion.java:288) at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211) at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162) at java.lang.Thread.run(Thread.java:619) 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception java.lang.NullPointerException at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095) at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217) at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162) at java.lang.Thread.run(Thread.java:619) 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting The HMaster then tries to assign the same region to X.60 again and fails. The HMaster tries to assign the region to X.31 with the same result (X.31 worker thread exits). The file it is complaining about, /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in HDFS. After deleting that file and restarting HBase, HBase appears to be back to normal. One thing I can't figure out is that the HMaster log show several entries after the worker thread on X.60 has exited suggesting that the RegionServer is talking with HMaster: 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 10.100.11.60:60020 There is no corresponding entry in the RegionServer's log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.