Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E7809096 for ; Wed, 22 Feb 2012 17:48:10 +0000 (UTC) Received: (qmail 3580 invoked by uid 500); 22 Feb 2012 17:48:10 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 3544 invoked by uid 500); 22 Feb 2012 17:48:10 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 3536 invoked by uid 99); 22 Feb 2012 17:48:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 17:48:10 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 17:48:08 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id BB3A53357FF for ; Wed, 22 Feb 2012 17:47:48 +0000 (UTC) Date: Wed, 22 Feb 2012 17:47:48 +0000 (UTC) From: "stack (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <1924332083.5061.1329932868768.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <464539320.49497.1329449640306.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5422: ------------------------- Resolution: Fixed Fix Version/s: 0.92.1 Assignee: chunhui shen Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to 0.92 branch and to trunk. Thanks for the patch Chunhui. > StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) > ------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-5422 > URL: https://issues.apache.org/jira/browse/HBASE-5422 > Project: HBase > Issue Type: Bug > Components: master > Reporter: chunhui shen > Assignee: chunhui shen > Fix For: 0.92.1 > > Attachments: 5422-90.patch, 5422-90v2.patch, 5422-90v3.patch, 5422-v3.txt, hbase-5422.patch, hbase-5422v2.patch, hbase-5422v3.patch > > > In our produce environment > We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). > First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) > master's log > 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state > 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 > 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 > 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 > 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 > 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node > 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED > 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED > 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 > 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers > 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 > 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 > 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > Regionserver's log > 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. > Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region. > Let's see the code of StartupBulkAssigner, we could find regionPlans are not added when assigning regions, therefore, when one region opened, it will not updateTimers of other regions whose destination is the same. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira