hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Tom <work2m...@gmail.com>
Subject Re: Hbase Replication no longer replicating, help diagnose
Date Tue, 19 Apr 2016 22:33:16 GMT
my timeout is set pretty high

1200000

maybe too high
We do get bursts of large changes when I update hbase via hive map reduce

I restarted both clusters and they caught up - but after a couple of days
they just slow down and stop

On Fri, Apr 15, 2016 at 5:49 AM, ashish singhi <ashish.singhi@huawei.com>
wrote:

> Let me explain in theory how it works (considering default configuration
> values)
>
> Assume 1 peer RS is already handling
> 3(hbase.regionserver.replication.handler.count) replication requests and it
> was not completed within 1 minute(hbase.rpc.timeout) time (due to some
> unknown reasons, may be slow rs or network speed...) then source RS will
> get CallTimeOutException and it will resend this request again to the same
> peer RS, so now this requests will be added in this peer RS queue (Max
> queue size = 30,
> hbase.regionserver.replication.handler.count*hbase.ipc.server.max.callqueue.length).
> Both running and waiting requests size will be counted for callQueueSize,
> so (running + waiting requests)*64MB(replication.source.size.capacity) will
> cross the call queue size 1GB(hbase.ipc.server.max.callqueue.size) and will
> result into CallQueueTooBigException exception.
>
> Now why those running requests are not getting completed, I assume this
> can be a reason, 1 peer RS received a replication request and it internally
> distributes this batch call to other RS in the peer cluster and this may
> get stuck as other peer RS also would have received replication request
> from other source cluster RS... so it might result in a kind off dead lock,
> where 1 peer RS is waiting for another peer RS to finish the request and
> that RS in turn might be processing some other request and waiting for its
> completion.
>
> So to avoid this problem, we need to find out the cause why peer RS is
> slow ? Based on that and network speed, need to adjust the
> hbase.rpc.timeout value and restart the source and peer cluster.
>
> Regards,
> Ashish
>
> -----Original Message-----
> From: Abraham Tom [mailto:work2much@gmail.com]
> Sent: 14 April 2016 18:52
> To: Hbase-User
> Subject: Hbase Replication no longer replicating, help diagnose
>
> my hbase replication has stopped
>
> I am on hbase version 1.0.0-cdh5.4.8 (Cloudera build)
>
> I have 2 clusters in 2 different datacenters
>
> 1 is master the other is slave
>
>
>
> I see the following errors in log
>
>
>
> 2016-04-13 22:32:50,217 WARN
>
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
> Can't replicate because of a local or network error:
> java.io.IOException: Call to
> hadoop2-private.sjc03.infra.com/10.160.22.99:60020 failed on local
> exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1014,
> waitTime=1200001, operationTimeout=1200000 expired.
>         at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1255)
>         at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1223)
>         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
>         at
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
>         at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1014,
> waitTime=1200001, operationTimeout=1200000 expired.
>         at
> org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
>         at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1197)
>         ... 7 more
>
>
>
>
>
> which in turn fills the queue and I get
>
> 2016-04-13 22:35:19,555 WARN
>
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
> Can't replicate because of an error on the remote cluster:
>
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
> Call queue is full on /0.0.0.0:60020, is
> hbase.ipc.server.max.callqueue.size too small?
>         at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
>         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
>         at
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
>         at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)
>
>
> My peers look good and this was working until Mar 27
>
> we did have an inadvertent outage but I was able to restore all cluster
> services
>
>
>
> status 'replication'
> version 1.0.0-cdh5.4.8
> 5 live servers
>     hadoop5-private.wdc01.infra.com:
>        SOURCE: PeerID=1, AgeOfLastShippedOp=1538240180,
> SizeOfLogQueue=2135, TimeStampsOfLastShippedOp=Sun Mar 27 04:00:42
> GMT+00:00 2016, Replication Lag=1539342209
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Tue Mar
> 22 10:09:39 GMT+00:00 2016
>     hadoop2-private.wdc01.infra.com:
>        SOURCE: PeerID=1, AgeOfLastShippedOp=810222876,
> SizeOfLogQueue=1302, TimeStampsOfLastShippedOp=Mon Apr 04 14:31:37
> GMT+00:00 2016, Replication Lag=810287122
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Mar
> 25 21:20:59 GMT+00:00 2016
>     hadoop4-private.wdc01.infra.com:
>        SOURCE: PeerID=1, AgeOfLastShippedOp=602417946, SizeOfLogQueue=190,
> TimeStampsOfLastShippedOp=Thu Apr 07 00:06:38
> GMT+00:00 2016, Replication Lag=602983605
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Mon Apr
> 04 14:35:56 GMT+00:00 2016
>     hadoop1-private.wdc01.infra.com:
>        SOURCE: PeerID=1, AgeOfLastShippedOp=602574285, SizeOfLogQueue=183,
> TimeStampsOfLastShippedOp=Thu Apr 07 00:10:29
> GMT+00:00 2016, Replication Lag=602753383
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr
> 07 00:10:23 GMT+00:00 2016
>     hadoop3-private.wdc01.infra.com:
>        SOURCE: PeerID=1, AgeOfLastShippedOp=602002192,
> SizeOfLogQueue=1148, TimeStampsOfLastShippedOp=Thu Apr 07 00:06:52
> GMT+00:00 2016, Replication Lag=602971172
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr
> 07 00:06:50 GMT+00:00 2016
>
>
>
> I can curl the quorum I set so I don't think its network
>
>
>
> What can I do to troubleshoot?
>
>
>
> Tried to run the following
>
> hbase org.apache.hadoop.hbase.replication.regionserver.ReplicationSyncUp
> 100000
>
> got the following response
>
> 16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Socket connection
> established, initiating session, client: /10.125.122.237:50784,
> server: hadoop2-private.sjc03.infra.com/10.160.22.99:2181
> 16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Session establishment
> complete on server hadoop2-private.sjc03.infra.com/10.160.22.99:2181,
> sessionid = 0x252f1a90269f5d6, negotiated timeout = 150000
> 16/04/13 23:37:17 INFO regionserver.ReplicationSource: Replicating
> de6643f5-2a36-413e-b55f-8840b26395b1 ->
> 06a68811-0e50-4802-a478-d199df96bf85
> 16/04/13 23:37:27 INFO regionserver.ReplicationSource: Closing source
> 1 because: Region server is closing
> 16/04/13 23:37:27 WARN regionserver.ReplicationSource: Interrupted while
> reading edits java.lang.InterruptedException
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>         at
> java.util.concurrent.PriorityBlockingQueue.poll(PriorityBlockingQueue.java:553)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.getNextPath(ReplicationSource.java:489)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:308)
> 16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x252f1a90269f5d6
> closed
> 16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
> 16/04/13 23:37:27 INFO
> client.ConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x152f1a8ff4ef600
> 16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x152f1a8ff4ef600
> closed
> 16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
> 16/04/13 23:37:31 INFO zookeeper.ZooKeeper: Session: 0x153ee0d274c3c6a
> closed
> 16/04/13 23:37:31 INFO zookeeper.ClientCnxn: EventThread shut down
>
>
> I am willing to lose the queue if there is a way to flush it and reset the
> sync process, because I can distscp of various data and manually load my
> tables to play catchup
>
>
> Or if there are other things I should try to diagnose to find where the
> log jam is
>



-- 
Abraham Tom
Email:   work2much@gmail.com
Phone:  415-515-3621

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message