Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1FEA6177EE for ; Sat, 4 Oct 2014 00:12:35 +0000 (UTC) Received: (qmail 10646 invoked by uid 500); 4 Oct 2014 00:12:35 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 10606 invoked by uid 500); 4 Oct 2014 00:12:34 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 10594 invoked by uid 99); 4 Oct 2014 00:12:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Oct 2014 00:12:34 +0000 Date: Sat, 4 Oct 2014 00:12:34 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158762#comment-14158762 ] stack commented on HBASE-12166: ------------------------------- +1 > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --------------------------------------------------------------- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal > Reporter: stack > Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, log.txt > > > See https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] master.SplitLogManager(294): finished splitting (more than or equal to) 197337 bytes in 1 log files in [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would seem to indicate that we successfully wrote zk that we are recovering: > {code} > 2014-10-03 01:57:47,672 DEBUG [MASTER_SERVER_OPERATIONS-asf900:37113-0] coordination.ZKSplitLogManagerCoordination(647): Mark region eba5d23de65f2718715eeb22edf7edc2 recovering from failed region server asf900.gq1.ygridcore.net,42071,1412301461790 > {code} > As part of the open of namespace, we updated our last flushed id successfully: > {code} > 2ae5d774fe8cd33bd4cda/family > 2014-10-03 01:57:47,698 DEBUG [Thread-9216-EventThread] zookeeper.ZooKeeperWatcher(304): master:37113-0x148d3b7d84b020f, quorum=localhost:50702, baseZNode=/hbase Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/recovering-regions/eba5d23de65f2718715eeb22edf7edc2 > 2014-10-03 01:57:47,699 DEBUG [PostOpenDeployTasks:eba5d23de65f2718715eeb22edf7edc2] regionserver.HRegionServer(2947): Update last flushed sequence id of region eba5d23de65f2718715eeb22edf7edc2 for asf900.gq1.ygridcore.net,42071,1412301461790 > {code} > Not sure why this node is up in zk. > Let me dig again later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)