hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy <ferdy.gal...@kalooga.com>
Subject Regionserver problems because of datanode timeouts
Date Wed, 20 Jan 2010 16:22:59 GMT
Hello,

I am having issues of regionservers shutting down because the underlying 
DFS is having trouble. I cannot get the DFS to function properly. I 
attached a log of our regionserver below. At first sight it looks like 
there is a timeout that occurs while the regionservers was trying to 
connect to it's own datanode:
java.net.SocketTimeoutException: 69000 millis timeout while waiting for 
channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203 
remote=/172.23.122.23:50010]

The datanode logs show similar timeout errors (also when trying to 
connect to local host).

Our configuration:
* Hadoop 0.20.1 and Hbase 0.20.2
* 20 nodes: each datanode/tasktracker and regionserver
* ulimit = 32000 on each node
* dfs.datanode.max.xcievers=5000 (should be big enough i presume)
* dfs.namenode.handler.count=30 (3 times the default)
* dfs.datanode.handler.count=9 (3 times the default)
* Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode 
has 1600M heapsize. The datanodes are started with the 
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode options.
* The DFS is fairly heavily used. (Always running a few concurrent 
MapRed jobs, with average IO)

Anyone a clue what might be going on?

Regards,
Ferdy.

THE REGIONSERVER LOG:
2010-01-20 13:52:15,982 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction 
started.  Attempting to free 22928768 bytes
2010-01-20 13:52:15,988 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction 
completed. Freed 22985616 bytes.  Priority Sizes: Single=84.519615MB 
(88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
2010-01-20 13:53:03,613 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
Total=169.96802MB (178224384), Free=48.36949MB (50719088), 
Max=218.33751MB (228943472), Counts: Blocks=2683, Access=184190, 
Hit=10090, Miss=174100, Evictions=7, Evicted=2443, Ratios: Hit 
Ratio=5.478039011359215%, Miss Ratio=94.521963596344%, Evicted/Run=349.0
2010-01-20 13:54:03,605 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
Total=174.83826MB (183331208), Free=43.499245MB (45612264), 
Max=218.33751MB (228943472), Counts: Blocks=2762, Access=184477, 
Hit=10298, Miss=174179, Evictions=7, Evicted=2443, Ratios: Hit 
Ratio=5.582267791032791%, Miss Ratio=94.41773295402527%, Evicted/Run=349.0
2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203 
remote=/172.23.122.23:50010]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.DataInputStream.readFully(Unknown Source)
        at java.io.DataInputStream.readLong(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)

2010-01-20 13:55:45,744 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
Total=178.23663MB (186894656), Free=40.100876MB (42048816), 
Max=218.33751MB (228943472), Counts: Blocks=2816, Access=184676, 
Hit=10443, Miss=174233, Evictions=7, Evicted=2443, Ratios: Hit 
Ratio=5.654768273234367%, Miss Ratio=94.34522986412048%, Evicted/Run=349.0
2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception 
closing session 0x1264b1fe3240006 to sun.nio.ch.SelectionKeyImpl@2f242b11
java.io.IOException: TIMED OUT
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We 
slept 66554ms, ten times longer than scheduled: 3000
2010-01-20 13:55:45,911 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
state: Disconnected, type: None, path: null
2010-01-20 13:55:46,178 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
        at java.lang.Thread.run(Unknown Source)
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_-209323108721490976_6065406 bad datanode[0] 
172.23.122.23:50010
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_-209323108721490976_6065406 in pipeline 
172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad 
datanode 172.23.122.23:50010
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: 
IPC Server handler 3 on 60020 took 71023ms appending an edit to hlog; 
editcount=41251
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: 
regionserver/172.23.122.23:60020.logFlusher took 67200ms optional 
sync'ing hlog; editcount=41271
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: 
IPC Server handler 1 on 60020 took 69781ms appending an edit to hlog; 
editcount=41271
2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting 
connection to server c1.machines.nl/172.23.122.1:2181
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming 
connection to java.nio.channels.SocketChannel[connected 
local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server 
connection successful
2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception 
closing session 0x1264b1fe3240006 to sun.nio.ch.SelectionKeyImpl@5f84f3d2
java.io.IOException: Session Expired
        at 
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-20 13:55:47,086 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
state: Expired, type: None, path: null
2010-01-20 13:55:47,086 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session 
expired


Mime
View raw message