Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EDCE8927D for ; Wed, 18 Apr 2012 18:45:04 +0000 (UTC) Received: (qmail 1967 invoked by uid 500); 18 Apr 2012 18:45:04 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 1934 invoked by uid 500); 18 Apr 2012 18:45:04 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 1924 invoked by uid 99); 18 Apr 2012 18:45:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2012 18:45:04 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2012 18:45:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 3612E3A1172 for ; Wed, 18 Apr 2012 18:44:40 +0000 (UTC) Date: Wed, 18 Apr 2012 18:44:40 +0000 (UTC) From: "stack (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <2136418684.1983.1334774680223.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1206332976.35790.1334718314571.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5816: ------------------------- Priority: Critical (was: Major) Summary: Balancer and ServerShutdownHandler concurrently reassigning the same region (was: Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; ") Changing title and upping the priority. At a minimum the balancer and shutdown handler should populate a queue that a single assignment process works on rather than as we have now, two threads doing independent assigns or balancer should not move regions that are currently in RIT or regions that were moved recently, etc. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --------------------------------------------------------------------------- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.6 > Reporter: Maryann Xue > Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to PENDING_OPEN, while the second assign follows immediately into "assign" and fails the RegionState check in setOfflineInZooKeeper(). This causes the master to abort. > In the below case, the two concurrent assigns occurred when AM tried to assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. (offlining) > 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=CLOSED, ts=1334612697672, server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN, ts=1334613179096, server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 remote=/10.239.47.87:60020] > at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) > at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 remote=/10.239.47.87:60020] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479) > 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. so generated a random one; hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=, dest=hadoop06.sh.intel.com,60020,1334544901894; 7 (online=7, exclude=serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers > 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:54:59,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=, dest=hadoop06.sh.intel.com,60020,1334544901894 > 2012-04-17 05:54:59,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to hadoop06.sh.intel.com,60020,1334544901894 > 2012-04-17 05:54:59,046 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN, ts=1334613299045, server=hadoop06.sh.intel.com,60020,1334544901894 > java.lang.IllegalStateException > at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1167) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1107) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) > at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:259) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > 2012-04-17 05:54:59,047 INFO org.apache.hadoop.hbase.master.HMaster: Aborting -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira