Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 30 Apr 2015 08:23:06 +0000 (UTC)
From: "Matteo Bertozzi (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12781449.1426153151000.32656.1430382186657@Atlassian.JIRA>
In-Reply-To: <JIRA.12781449.1426153151000@Atlassian.JIRA>
References: <JIRA.12781449.1426153151000@Atlassian.JIRA>
 <JIRA.12781449.1426153151297@arcas>
Subject: [jira] [Commented] (HBASE-13217) Flush procedure fails in trunk due
 to ZK issue
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521118#comment-14521118 ] 

Matteo Bertozzi commented on HBASE-13217:
-----------------------------------------

{quote}Here is what the znode looks like when MASTER thinks the procedure is completed (obviously, myregionserver-5 is missing under 'reached' znode{quote}
my first guess is that when the procedure started the regionserver-5 was in transition and not in the list. there should be a list of "online regionservers" that the procedure uses to setup the latch used to decide when everyone is completed. that procedure code doesn't tolerate region in transition, so that may be the problem you are seeing. (but may be something else, this is just my first guess without looking at the code)

> Flush procedure fails in trunk due to ZK issue
> ----------------------------------------------
>
>                 Key: HBASE-13217
>                 URL: https://issues.apache.org/jira/browse/HBASE-13217
>             Project: HBase
>          Issue Type: Bug
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: Stephen Yuan Jiang
>
> When ever I try to flush explicitly in the trunk code the flush procedure fails due to ZK issue
> {code}
> ERROR: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via stobdtserver3,16040,1426172670959:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at org.apache.hadoop.hbase.procedure.Procedure.isCompleted(Procedure.java:368)
>         at org.apache.hadoop.hbase.procedure.flush.MasterFlushTableProcedureManager.isProcedureDone(MasterFlushTableProcedureManager.java:196)
>         at org.apache.hadoop.hbase.master.MasterRpcServices.isProcedureDone(MasterRpcServices.java:905)
>         at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:47019)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2073)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:273)
>         at org.apache.hadoop.hbase.procedure.ProcedureMember.controllerConnectionFailure(ProcedureMember.java:225)
>         at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberAcquired(ZKProcedureMemberRpcs.java:254)
>         at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:166)
>         at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         ... 1 more
> {code}
> Once this occurs, even on restart of the RS the RS becomes unusable.  I have verified that the ZK remains intact and there is no problem with it.  a bit older version of trunk ( 3months) does not have this problem.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)