hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <andrew.purt...@gmail.com>
Subject Re: bad datanode causes all regions not accessible
Date Mon, 04 Jun 2012 15:36:18 GMT
Use of Hadoop 0.20.2 is not advisable. HBase 0.20.6 is quite old. 

The latest production ready release of Hadoop is 1.0.3. 

The latest production ready release of HBase is 0.94.0, that is 4 major revisions ahead of
0.20. 

Consider upgrading. 

Best regards,

    - Andy

On Jun 4, 2012, at 6:40 AM, Ravi Prakash <ravihadoop@gmail.com> wrote:

> Hi Maoke,
> 
> Thanks for the report. Its more likely relevant to the hbase user list.
> 
> 
> On Fri, Jun 1, 2012 at 6:42 AM, Maoke <fibrib@gmail.com> wrote:
> 
>> hi,
>> 
>> we encountered the following phinomenon:
>> 
>> 1. exception on one of the 8 datanodes:
>> 
>> 2012-05-20 17:13:49,854 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_9016752944216030896_4468475 src: /192.168.128.114:41922 dest: /
>> 192.168.128.114:50010
>> 2012-05-20 17:15:19,910 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>> blk_9016752944216030896_4468475 2 Exception java.net.SocketException:
>> Connection reset
>>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>  at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>  at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
>>  at
>> 
>> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:132)
>>  at
>> 
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:875)
>>  at java.lang.Thread.run(Thread.java:619)
>> 
>> after that this datanode seemed hang up without any further log until we
>> reboot it.
>> 
>> 2. meanwhile, the namenode said:
>> 
>> 2012-05-20 17:19:11,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.heartbeatCheck: lost heartbeat from 192.168.128.114:50010
>> 2012-05-20 17:19:11,646 INFO org.apache.hadoop.net.NetworkTopology:
>> Removing a node: /dc1/switch1/rack1/node6/192.168.128.114:50010
>> 
>> 3. the regionserver over this datanode recorded:
>> 
>> 2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient:
>> DataStreamer Exception: java.net.SocketTimeoutException: 15000 millis
>> timeout while waiting for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.128.114:41922
>> remote=/
>> 192.168.128.114:50010]
>>  at
>> 
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>  at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>  at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>  at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>  at
>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2314)
>> 
>> 2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_9016752944216030896_4468475 bad datanode[0]
>> 192.168.128.114:50010
>> 2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_9016752944216030896_4468475 in pipeline
>> 192.168.128.114:50010, 192.168.128.104:50010, 192.168.128.105:50010: bad
>> datanode 192.168.128.114:50010
>> 
>> 4. until we reboot the wk008, all user regions are not accessible through
>> scan/get/put; we didn't scan -ROOT- or .META. in the shell but the master
>> log doesn't show any abnormal stuff during that while.
>> 
>> our environment is running hadoop 0.20.2 + hbase 0.20.6 + zookeeper 3.2.2.
>> 
>> is there anyone having encountered the similar problem? especially, why the
>> client resetting connection makes the datanode die? why the only one dead
>> datanode causes all the user regions not accessible?
>> 
>> thanks a lot,
>> maoke
>> 

Mime
View raw message