hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8919) TestReplicationQueueFailover (and Compressed) can fail because the recovered queue gets stuck on ClosedByInterruptException
Date Tue, 16 Jul 2013 00:02:49 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709216#comment-13709216
] 

Jean-Daniel Cryans commented on HBASE-8919:
-------------------------------------------

Finally got another test that failed with the stack trace, here it is (from http://54.241.6.143/job/HBase-0.95/org.apache.hbase$hbase-server/610/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/):

{noformat}
2013-07-12 21:17:00,382 INFO  [Thread-962] regionserver.ReplicationSource$2(799): Slave cluster
looks down: Call to ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599 failed
on local exception: java.nio.channels.ClosedByInterruptException
java.io.IOException: Call to ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599
failed on local exception: java.nio.channels.ClosedByInterruptException
	at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1401)
	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1373)
	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:15213)
	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1466)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$2.run(ReplicationSource.java:793)
Caused by: java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:343)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
	at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:231)
	at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220)
	at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1014)
	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1349)
{noformat}
                
> TestReplicationQueueFailover (and Compressed) can fail because the recovered queue gets
stuck on ClosedByInterruptException
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8919
>                 URL: https://issues.apache.org/jira/browse/HBASE-8919
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>         Attachments: HBASE-8919.patch
>
>
> Looking at this build: https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> The only thing I can find that went wrong is that the recovered queue was not completely
done because the source fails like this:
> {noformat}
> 2013-07-10 11:53:51,538 INFO  [Thread-1259] regionserver.ReplicationSource$2(799): Slave
cluster looks down: Call to hemera.apache.org/140.211.11.27:38614 failed on local exception:
java.nio.channels.ClosedByInterruptException
> {noformat}
> And just before that it got:
> {noformat}
> 2013-07-10 11:53:51,290 WARN  [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379]
regionserver.ReplicationSource(661): Can't replicate because of an error on the remote cluster:

> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException):
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1594 actions:
FailedServerException: 1594 times, 
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158)
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146)
> 	at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106)
> 	at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689)
> 	at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697)
> 	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161)
> 	at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122)
> 	at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829)
> 	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369)
> 	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
> 	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177)
> 	at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376)
> {noformat}
> I wonder what's closing the socket with an interrupt, it seems it still needs to replicate
more data. I'll start by adding the stack trace for the message when it fails to replicate
on a "local exception". Also I found a thread that wasn't shutdown properly that I'm going
to fix to help with debugging.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message