hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: SocketTimeoutException caused by GC?
Date Thu, 27 Jan 2011 15:53:28 GMT
About bad datanode error, I found 164 occurrences in 7-node dev cluster
hbase 0.90 region server logs.
In our 14 node staging cluster running hbase 0.20.6, I found none.

Both use cdh3b2 hadoop.

On Thu, Jan 27, 2011 at 6:48 AM, Wayne <wav100@gmail.com> wrote:

> We have got .90 up and running well, but again after 24 hours of loading a
> node went down. Under it all I assume it is a GC issue, but the GC logging
> rolls every < 60 minutes so I can never see logs from 5 hours ago (working
> on getting Scribe up to solve that). Most of our issues are a node being
> marked as dead after being un-responsive. It often starts with a socket
> timeout. We can turn up the timeout for zookeeper but that is not dealing
> with the issue.
>
> Here is the first sign of trouble. Is the 1 min 34 second gap below most
> likely a stop the world GC?
>
> 2011-01-27 07:00:43,716 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
> Roll
> /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111623011,
> entries=242, filesize=69508440. New hlog
> /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111643436
> 2011-01-27 07:02:17,663 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception for block
> blk_-5705652521953118952_104835java.net.SocketTimeoutException: 69000
> millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/x.x.x.6:48141
> remote=/x.x.x.6:50010]
>
> It is followed by zookeeper complaining due to lack of a response.
>
> 2011-01-27 07:02:17,665 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 94590ms for sessionid
> 0x2dbdc88700000e, closing socket connection and attempting reconnect
>
> There is also a message about the data node.
>
> 2011-01-27 07:02:17,665 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_4267667433820446273_104837 bad datanode[0]
> x.x.x.6:50010
>
> And eventually the node is brought down.
>
> 2011-01-27 07:02:17,783 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server...
>
> The data node also shows some errors.
>
> 2011-01-27 07:02:17,667 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(x.x.x.6:50010,
> storageID=DS-1438948528-x.x.x.6-50010-1295969305669, infoPort=50075,
> ipcPort=50020):DataXceiver java.net.SocketException: Connection reset
>
>
> Any help, advice, ideas, or guesses would be greatly appreciated. Can
> anyone
> sustain 30-40k writes/node/sec for days/weeks on end without using the bulk
> loader? Am I rolling a rock uphill against the reality of the JVM?
>
> Thanks.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message