Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 43630 invoked from network); 23 Jul 2010 21:29:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Jul 2010 21:29:12 -0000 Received: (qmail 39029 invoked by uid 500); 23 Jul 2010 21:29:12 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 38964 invoked by uid 500); 23 Jul 2010 21:29:12 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 38951 invoked by uid 99); 23 Jul 2010 21:29:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jul 2010 21:29:12 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jul 2010 21:29:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6NLSoWV021517 for ; Fri, 23 Jul 2010 21:28:51 GMT Message-ID: <9429205.558081279920530759.JavaMail.jira@thor> Date: Fri, 23 Jul 2010 17:28:50 -0400 (EDT) From: "HBase Review Board (JIRA)" To: issues@hbase.apache.org Subject: [jira] Commented: (HBASE-2866) Region permanently offlined In-Reply-To: <14173816.525241279830110891.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891806#action_12891806 ] HBase Review Board commented on HBASE-2866: ------------------------------------------- Message from: "Karthik Ranganathan" ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/380/ ----------------------------------------------------------- (Updated 2010-07-23 14:26:01.718168) Review request for hbase, stack and Kannan Muthukkaruppan. Changes ------- Adding hbase group Summary ------- Region permanently offlined - if the ZNode is already in the target state, do not update it again. This addresses bug HBASE-2866. http://issues.apache.org/jira/browse/HBASE-2866 Diffs ----- trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java 967128 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java 967128 Diff: http://review.hbase.org/r/380/diff Testing ------- Ran unit tests, went through fine (except TestRowAtomicity, which is known to be failing). Thanks, Karthik > Region permanently offlined > ---------------------------- > > Key: HBASE-2866 > URL: https://issues.apache.org/jira/browse/HBASE-2866 > Project: HBase > Issue Type: Bug > Reporter: Kannan Muthukkaruppan > Assignee: Karthik Ranganathan > Priority: Blocker > Attachments: master.log > > > After split, master attempts to reassign a region to a region server. Occasionally, such a region can get permanently offlined. > Master: > --------- > {code} > 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\ > 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744; 1 of 1 > 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE > 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE > 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744 > 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE > 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE > {code} > ------------------- > Region Server: > {code} > 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. > 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e. > 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. > 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] expected version = 0 > 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 > 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062) > at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161) > at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115) > at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428) > at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337) > at java.lang.Thread.run(Thread.java:619) > 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. > java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072) > {code} > Meta: > ----- > Relevant section of META. > Note that these are the only two entries for the problem region. The first one is the parent region (and this problem > region is its splitB). For the next one, note that there is no "info:server" and "info:serverstartcode" columns. > {code} > test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00 > 17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb > 7622735487a72. 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x > 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META > \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\ > x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO > PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0 > 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147 > 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME > MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x > FE\xA0\xFD\xC5 > .. > test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION => {NAME => > 58624.8d5490bfc166c687657 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR > cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000', ENCODED => 8d5490bfc166c687 > 657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'acti > ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', C > OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR > Y => 'false', BLOCKCACHE => 'true'}]}} > {code} > I think Karthik has a handle on the first part (i.e. why the RS ran into the version mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't clear about at this stage is why the base scanner didn't kick in and try to reassign the region. > BTW, HBase "hbck" reported this as well (which was good!): > {code} > Number of Tables: 5 > Number of live region servers:92 > Number of dead region servers:0 > ......... > ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not served by any region server but is listed in META to be on server null > ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not served by any region server but is listed in META to be on server null > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.