Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C369CAE4 for ; Tue, 16 Jul 2013 17:46:52 +0000 (UTC) Received: (qmail 98738 invoked by uid 500); 16 Jul 2013 17:46:52 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 98382 invoked by uid 500); 16 Jul 2013 17:46:51 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 97823 invoked by uid 99); 16 Jul 2013 17:46:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 17:46:50 +0000 Date: Tue, 16 Jul 2013 17:46:50 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8919) TestReplicationQueueFailover (and Compressed) can fail because the recovered queue gets stuck on ClosedByInterruptException MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709982#comment-13709982 ] stack commented on HBASE-8919: ------------------------------ New one http://54.241.6.143/job/HBase-0.95-Hadoop-2/635/org.apache.hbase$hbase-server/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/ > TestReplicationQueueFailover (and Compressed) can fail because the recovered queue gets stuck on ClosedByInterruptException > --------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-8919 > URL: https://issues.apache.org/jira/browse/HBASE-8919 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Attachments: HBASE-8919.patch > > > Looking at this build: https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/ > The only thing I can find that went wrong is that the recovered queue was not completely done because the source fails like this: > {noformat} > 2013-07-10 11:53:51,538 INFO [Thread-1259] regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to hemera.apache.org/140.211.11.27:38614 failed on local exception: java.nio.channels.ClosedByInterruptException > {noformat} > And just before that it got: > {noformat} > 2013-07-10 11:53:51,290 WARN [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379] regionserver.ReplicationSource(661): Can't replicate because of an error on the remote cluster: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException): org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1594 actions: FailedServerException: 1594 times, > at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158) > at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146) > at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161) > at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173) > at org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122) > at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829) > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369) > at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573) > at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177) > at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376) > {noformat} > I wonder what's closing the socket with an interrupt, it seems it still needs to replicate more data. I'll start by adding the stack trace for the message when it fails to replicate on a "local exception". Also I found a thread that wasn't shutdown properly that I'm going to fix to help with debugging. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira