hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy <ferdy.gal...@kalooga.com>
Subject Re: Regionserver problems because of datanode timeouts
Date Wed, 03 Feb 2010 09:30:21 GMT
Hi,

Increasing the memory did not work. Is it possible to increase the time 
out period? All our jobs are offline processing jobs, so I'd rather have 
a low responsiveness than a Zookeeper that decides to terminate a 
regionserver.

I also noticed that a regionserver crash also happens when there is no 
load at all on Hbase, in other words they seem to occur randomly.

Ferdy

Jean-Daniel Cryans wrote:
> I see more than one problem here.
>
>   
>> DFSOutputStream ResponseProcessor exception  for block
>> blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
>>     
>
> As you said your HDFS is getting hit a lot and this is a sign of it.
> Some people use dfs.datanode.socket.write.timeout and
> dfs.datanode.socket.read.timeout set to very high values in that case
> (like more than 10 minutes). If you want to serve live data from HBase
> it's really not the best and you should consider separating processing
> and serving in 2 clusters OR lower the number of tasks on the nodes.
>
>   
>> java.io.IOException: TIMED OUT
>>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> 2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
>> 66554ms, ten times longer than scheduled: 3000
>>     
>
> This means your region server was garbage collecting for more than a
> minute, which is higher than the default time to consider a region
> server as dead. This lead to:
>
>   
>> java.io.IOException: Session Expired
>>       at
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>     
>
> This is how we notice a server is gone missing, each region server has
> an ephemeral node in Zookeeper and they lose it after 40 secs in
> 0.20.2 (and previous) and after 60 secs in 0.20.3.
>
> Consider giving at least 4GB of heap to the region servers and make
> sure you never swap.
>
> J-D
>
> On Wed, Jan 20, 2010 at 8:22 AM, Ferdy <ferdy.galema@kalooga.com> wrote:
>   
>> Hello,
>>
>> I am having issues of regionservers shutting down because the underlying DFS
>> is having trouble. I cannot get the DFS to function properly. I attached a
>> log of our regionserver below. At first sight it looks like there is a
>> timeout that occurs while the regionservers was trying to connect to it's
>> own datanode:
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
>> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
>> local=/172.23.122.23:39203 remote=/172.23.122.23:50010]
>>
>> The datanode logs show similar timeout errors (also when trying to connect
>> to local host).
>>
>> Our configuration:
>> * Hadoop 0.20.1 and Hbase 0.20.2
>> * 20 nodes: each datanode/tasktracker and regionserver
>> * ulimit = 32000 on each node
>> * dfs.datanode.max.xcievers=5000 (should be big enough i presume)
>> * dfs.namenode.handler.count=30 (3 times the default)
>> * dfs.datanode.handler.count=9 (3 times the default)
>> * Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode has
>> 1600M heapsize. The datanodes are started with the -XX:+UseConcMarkSweepGC
>> -XX:+CMSIncrementalMode options.
>> * The DFS is fairly heavily used. (Always running a few concurrent MapRed
>> jobs, with average IO)
>>
>> Anyone a clue what might be going on?
>>
>> Regards,
>> Ferdy.
>>
>> THE REGIONSERVER LOG:
>> 2010-01-20 13:52:15,982 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> started.  Attempting to free 22928768 bytes
>> 2010-01-20 13:52:15,988 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> completed. Freed 22985616 bytes.  Priority Sizes: Single=84.519615MB
>> (88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
>> 2010-01-20 13:53:03,613 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=169.96802MB (178224384), Free=48.36949MB (50719088), Max=218.33751MB
>> (228943472), Counts: Blocks=2683, Access=184190, Hit=10090, Miss=174100,
>> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.478039011359215%, Miss
>> Ratio=94.521963596344%, Evicted/Run=349.0
>> 2010-01-20 13:54:03,605 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=174.83826MB (183331208), Free=43.499245MB (45612264), Max=218.33751MB
>> (228943472), Counts: Blocks=2762, Access=184477, Hit=10298, Miss=174179,
>> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.582267791032791%, Miss
>> Ratio=94.41773295402527%, Evicted/Run=349.0
>> 2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient:
>> DFSOutputStream ResponseProcessor exception  for block
>> blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
>> millis timeout while waiting for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
>> remote=/172.23.122.23:50010]
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>       at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>       at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>       at java.io.DataInputStream.readFully(Unknown Source)
>>       at java.io.DataInputStream.readLong(Unknown Source)
>>       at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)
>>
>> 2010-01-20 13:55:45,744 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=178.23663MB (186894656), Free=40.100876MB (42048816), Max=218.33751MB
>> (228943472), Counts: Blocks=2816, Access=184676, Hit=10443, Miss=174233,
>> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.654768273234367%, Miss
>> Ratio=94.34522986412048%, Evicted/Run=349.0
>> 2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x1264b1fe3240006 to sun.nio.ch.SelectionKeyImpl@2f242b11
>> java.io.IOException: TIMED OUT
>>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> 2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
>> 66554ms, ten times longer than scheduled: 3000
>> 2010-01-20 13:55:45,911 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Disconnected, type: None, path: null
>> 2010-01-20 13:55:46,178 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>       at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
>> Source)
>>       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>> Source)
>>       at java.lang.reflect.Constructor.newInstance(Unknown Source)
>>       at
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>       at
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>       at
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>       at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>       at java.lang.Thread.run(Unknown Source)
>> 2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_-209323108721490976_6065406 bad datanode[0]
>> 172.23.122.23:50010
>> 2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_-209323108721490976_6065406 in pipeline
>> 172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad datanode
>> 172.23.122.23:50010
>> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
>> Server handler 3 on 60020 took 71023ms appending an edit to hlog;
>> editcount=41251
>> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
>> regionserver/172.23.122.23:60020.logFlusher took 67200ms optional sync'ing
>> hlog; editcount=41271
>> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
>> Server handler 1 on 60020 took 69781ms appending an edit to hlog;
>> editcount=41271
>> 2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting
>> connection to server c1.machines.nl/172.23.122.1:2181
>> 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming
>> connection to java.nio.channels.SocketChannel[connected
>> local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
>> 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server
>> connection successful
>> 2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x1264b1fe3240006 to sun.nio.ch.SelectionKeyImpl@5f84f3d2
>> java.io.IOException: Session Expired
>>       at
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>       at
>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> 2010-01-20 13:55:47,086 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Expired, type: None, path: null
>> 2010-01-20 13:55:47,086 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
>> expired
>>
>>
>>     
>
>   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message