hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cyril Scetbon <cyril.scet...@free.fr>
Subject Re: hosts unreachables
Date Thu, 31 May 2012 19:10:04 GMT
I'm still having region nodes that crash :(

You can see reasons on 2 region nodes :

*node1 :*

2012-05-31 17:57:03,937 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on 
region ise,ID-TST-100-1Rw6ue7L8i
FMpWMqMmZYxDdoRLCIzI70e5kVMrgLrXW,1338486988233.aaaa5b0661020d1107eb31e982d71126.
2012-05-31 17:57:03,938 INFO org.apache.hadoop.hbase.regionserver.Store: 
Started compaction of 2 file(s) in cf=core, hasReferences
=true, into 
hdfs://hb-zk1:54310/hbase/ise/aaaa5b0661020d1107eb31e982d71126/.tmp, 
seqid=995472, totalSize=256.5m
2012-05-31 17:58:24,860 WARN org.apache.hadoop.hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block blk_98999474
0052943354_38499java.net.SocketTimeoutException: 72000 millis timeout 
while waiting for channel to be ready for read. ch : java.ni
o.channels.SocketChannel[connected local=/10.0.0.16:45314 
remote=/10.0.0.16:50010]
         at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
         at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
         at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
         at java.io.FilterInputStream.read(FilterInputStream.java:116)
         at java.io.DataInputStream.readFully(DataInputStream.java:178)
         at java.io.DataInputStream.readLong(DataInputStream.java:399)
         at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
         at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3049)
2012-05-31 17:59:11,655 INFO org.apache.zookeeper.ClientCnxn: Unable to 
read additional data from server sessionid 0x337a359bb9800
15, likely server has closed socket, closing socket connection and 
attempting reconnect
2012-05-31 17:59:11,687 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_989994740052943354_38499 bad datanode[
0] 10.0.0.16:50010
2012-05-31 17:59:11,689 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_989994740052943354_38499 in pipeline 1
0.0.0.16:50010, 10.0.0.13:50010, 10.0.0.11:50010, 10.0.0.15:50010: bad 
datanode 10.0.0.16:50010
2012-05-31 17:59:11,688 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region 
server serverName=hb-d6,60020,1338475673130, load=(requests=0, 
regions=13, usedHeap=434, maxHeap=4091): Unhandled exception: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
currently processing hb-d6,60020,1338475673130 as dead server
org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
currently processing hb-d6,60020,1338475673130 as dead server
         at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
         at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
         at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
         at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
         at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
         at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:749)

*node2 :
*
2012-05-31 18:26:44,339 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server hb-zk1/10.0.0.50:2181
2012-05-31 18:26:44,340 INFO org.apache.zookeeper.ClientCnxn: Socket 
connection established to hb-zk1/10.0.0.50:2181, initiating s
ession
2012-05-31 18:26:44,343 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
This client just lost it
's session with ZooKeeper, trying to reconnect.
2012-05-31 18:26:44,343 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Trying to reconnect to z
ookeeper.
2012-05-31 18:26:44,450 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x237a35960010023 closed
2012-05-31 18:26:44,450 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions 
to close
2012-05-31 18:26:44,451 INFO org.apache.zookeeper.ZooKeeper: Initiating 
client connection, connectString=hb-zk2:2181,hb-zk1:2181,h
b-zk3:2181 sessionTimeout=180000 watcher=hconnection
2012-05-31 18:26:44,459 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server hb-zk3/10.0.1.50:2181
2012-05-31 18:26:44,490 INFO org.apache.zookeeper.ClientCnxn: Socket 
connection established to hb-zk3/10.0.1.50:2181, initiating s
ession
2012-05-31 18:26:44,552 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted:
java.io.InterruptedIOException: Aborting compaction of store core in 
region ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFONbKmFSA
KxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. because it was 
interrupted.
         at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1008)
         at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:789)
         at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:810)
         at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:750)
         at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
2012-05-31 18:26:44,553 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: aborted compaction on 
region ise,ID-TST-100-P8xzKBeOhD3
p9z9iRQTws5X454eyixFONbKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. 
after 2mins, 12sec
2012-05-31 18:26:44,553 INFO 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: 
regionserver60020.compactor exiting
2012-05-31 18:26:44,555 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: Closed 
ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFON
bKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24.
2012-05-31 18:26:44,650 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server hb-zk3/10.0.1.50:2181, sess
ionid = 0x337a359bb980024, negotiated timeout = 40000
2012-05-31 18:26:44,659 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Reconnected successfully
. This disconnect could have been caused by a network partition or a 
long-running GC pause, either way it's recommended that you v
erify your environment.
2012-05-31 18:26:44,659 INFO org.apache.zookeeper.ClientCnxn: 
EventThread shut down
2012-05-31 18:26:45,006 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server hb-zk2/10.0.0.51:2181
2012-05-31 18:26:45,009 INFO org.apache.zookeeper.ClientCnxn: Socket 
connection established to hb-zk2/10.0.0.51:2181, initiating session
2012-05-31 18:26:45,012 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x337a359bb980001 has expired, 
closing socket connection
2012-05-31 18:26:45,012 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region 
server serverName=hb-d1,60020,1338475437868, load=(requests=145, 
regions=17, usedHeap=499, maxHeap=4091): 
regionserver:60020-0x337a359bb980001 
regionserver:60020-0x337a359bb980001 received expired from ZooKeeper, 
aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:361)
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:279)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

As I understand, node 2 failed cause of a long compaction, but I don't 
know how I can fix it. And for node 1 I don't really understand the 
cause and how to fix it too :(

I'm using hbase 0.90 from cdh3 packages (cloudera). I don't understand 
why a node that is considered as dead if there is an error can't come 
back after that ...

thanks

On 5/29/12 5:17 PM, Cyril Scetbon wrote:
> Hi,
>
> I've installed hbase on the following configuration :
>
> 12 x (rest hbase + regionserver hbase + datanode hadoop)
> 2 x (zookeeper + hbase master)
> 1 x (zookeeper + hbase master + namenode hadoop)
>
> OS used is ubuntu lucid (10.04)
>
> The issue is that when I try to load data using rest api, some hosts 
> become unreachable even if I can ping them. I can no longer connect to 
> them and even monitoring tools can not work during a laps of time. For 
> example, I use SAR on each host and you can see that between 7:10 and 
> 7:35 pm the host does not write any information :
>
> 06:45:01 PM     all      0.18      0.00      0.37      3.61      0.25 
> 95.58
> 06:45:01 PM       0      0.24      0.00      0.54      6.62      0.35 
> 92.25
> 06:45:01 PM       1      0.12      0.00      0.20      0.61      0.15 
> 98.92
> 06:50:02 PM     all      5.69      0.00      1.79      4.23      1.94 
> 86.36
> 06:50:02 PM       0      5.68      0.00      3.00      7.91      2.21 
> 81.21
> 06:50:02 PM       1      5.70      0.00      0.59      0.55      1.66 
> 91.51
> 06:55:01 PM     all      0.68      0.00      0.14      1.62      0.23 
> 97.33
> 06:55:01 PM       0      0.87      0.00      0.20      3.19      0.31 
> 95.44
> 06:55:01 PM       1      0.49      0.00      0.08      0.05      0.15 
> 99.22
> 06:58:36 PM     all      0.03      0.00      0.02      0.45      0.07 
> 99.43
> 06:58:36 PM       0      0.01      0.00      0.02      0.40      0.13 
> 99.43
> 06:58:36 PM       1      0.04      0.00      0.01      0.51      0.00 
> 99.43
> 07:05:01 PM     all      0.03      0.00      0.00      0.10      0.07 
> 99.80
> 07:05:01 PM       0      0.02      0.00      0.00      0.10      0.10 
> 99.78
> 07:05:01 PM       1      0.04      0.00      0.01      0.09      0.03 
> 99.83 <--- last measure before host becomes reachable
> 07:40:07 PM     all     14.72      0.00     17.93      0.02     13.31 
> 54.02 <--- new measure after host becomes reachable
> 07:40:07 PM       0     29.43      0.00     35.87      0.00     26.57 
>  8.13
> 07:40:07 PM       1      0.00      0.00      0.00      0.04      0.04 
> 99.91
> 07:45:01 PM     all      0.55      0.00      0.25      0.04      0.27 
> 98.89
> 07:45:01 PM       0      0.54      0.00      0.14      0.05      0.21 
> 99.07
> 07:45:01 PM       1      0.55      0.00      0.36      0.04      0.33 
> 98.72
> 07:50:01 PM     all      0.11      0.00      0.05      0.18      0.06 
> 99.60
> 07:50:01 PM       0      0.12      0.00      0.06      0.13      0.09 
> 99.60
> 07:50:01 PM       1      0.11      0.00      0.04      0.23      0.04 
> 99.59
> 07:55:01 PM     all      0.00      0.00      0.01      0.05      0.07 
> 99.88
> 07:55:01 PM       0      0.00      0.00      0.01      0.01      0.13 
> 99.84
> 07:55:01 PM       1      0.00      0.00      0.00      0.08      0.00 
> 99.91
> 08:05:01 PM     all      0.01      0.00      0.00      0.00      0.05 
> 99.94
> 08:05:01 PM       0      0.00      0.00      0.00      0.00      0.08 
> 99.91
> 08:05:01 PM       1      0.03      0.00      0.00      0.00      0.01 
> 99.96
>
> I suppose it's caused by a high load but I don't have any proof :( Is 
> there a known bug about that ? I had a similar issue with Cassandra 
> that forced me to upgrade to linux kernel > 3.0
>
> thanks.
>


-- 
Cyril SCETBON


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message