Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D9785D390 for ; Fri, 21 Sep 2012 18:55:09 +0000 (UTC) Received: (qmail 76037 invoked by uid 500); 21 Sep 2012 18:55:09 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 76002 invoked by uid 500); 21 Sep 2012 18:55:09 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 75990 invoked by uid 99); 21 Sep 2012 18:55:09 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Sep 2012 18:55:09 +0000 Date: Sat, 22 Sep 2012 05:55:09 +1100 (NCT) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: <398093136.108867.1348253709597.JavaMail.jiratomcat@arcas> In-Reply-To: <2046828395.76225.1341119324963.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6299: ------------------------- Attachment: 6299v4.txt Retry > RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6299 > URL: https://issues.apache.org/jira/browse/HBASE-6299 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.6, 0.94.0 > Reporter: Maryann Xue > Assignee: Maryann Xue > Priority: Critical > Fix For: 0.92.3, 0.94.3, 0.96.0 > > Attachments: 6299v4.txt, 6299v4.txt, HBASE-6299.patch, HBASE-6299-v2.patch, HBASE-6299-v3.patch > > > 1. HMaster tries to assign a region to an RS. > 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. > 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. > 4. HMaster attemps to assign for a second time, choosing another RS. > 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node "RS_ZK_REGION_OPENING" updated by the second attempt. > 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. > {code} > 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 > 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 > 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:60000, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node > 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED > 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED > 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) > 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765 remote=/172.16.0.6:60020] > at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy8.openRegion(Unknown Source) > at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) > at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765 remote=/172.16.0.6:60020] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479) > 2012-06-29 07:07:41,142 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. so generated a random one; hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=, dest=swbss-hadoop-164,60020,1340888346294; 15 (online=15, exclude=serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0)) available servers > 2012-06-29 07:07:41,142 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007 Creating (or updating) unassigned node for b713fd655fa02395496c5a6e39ddf568 with OFFLINE state > 2012-06-29 07:07:41,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=, dest=swbss-hadoop-164,60020,1340888346294 > 2012-06-29 07:07:41,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-164,60020,1340888346294 > 2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states > 2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states > 2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 > 2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states > 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-006,60020,1340890678078, dest=swbss-hadoop-008,60020,1340891085175 > 2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. (offlining) > 2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. > ... > 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. state=PENDING_CLOSE, ts=1340926468331, server=null > 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. > 2012-06-29 08:04:47,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. state=PENDING_CLOSE, ts=1340926468331, server=null > 2012-06-29 08:04:47,682 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira