hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prem yadav <ipremya...@gmail.com>
Subject region servers failing due to bad datanode
Date Mon, 20 Aug 2012 13:23:55 GMT

we have been facing some datanode related issues lately due to which the
region servers keep failing.
our cluster structure is as follows:

Hadoop -1.0.1
Hbase- 94.1

All the machine are running datanodes,tasktrackers,regionservers, and
map-reduce(rarely). These are all ec2 m1.large machines and have 7.5 GB
memory each. Region servers are assigned 4GB of memory.

It looks like for some reason, the datanode fails to respond to the region
server's query for a block and a timeout exception occurs. This causes the
region server to fail.
In some cases, we have also seen that the datanode commits the block with a
different block name. This is evident from the logs,

In this case, region server keeps querying for the old block name and gets
an error on the lines of

" org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_8680479961374491733_745849 failed  because recovery from primary
datanode <ip-address>:50010 failed 6 times"

The logs we get on the region server are:

2012-08-20 00:03:28,821 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-7841650651979512601_775949 in pipeline <ip>:50010,
<ip>:50010, <ip>:50010: bad datanode <datanode_ip>:50010

2012-08-20 00:03:28,758 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-7841650651979512601_775949 bad datanode[0]

or something like the following:

 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block
blk_-7841650651979512601_775949java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<local_ip>:37227
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readLong(DataInputStream.java:416)

The namenode logs:

2012-08-20 00:03:29,446 INFO
newgenerationstamp=775977, newlength=32204106, newtargets=[<ip-address of
datanodes>], closeFile=false, deleteBlock=false)

2012-08-19:2012-08-19 23:59:18,995 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.allocateBlock:

Datanode logs:

2012-08-19:2012-08-19 23:59:18,999 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-7841650651979512601_775949 src: /<ip>:42937 dest: /<ip>:50010
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_-7841650651979512601_775949 java.io.EOFException: while
trying to read 65557 bytes
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-7841650651979512601_775949 0 : Thread is interrupted.
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_-7841650651979512601_775949 terminating
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-7841650651979512601_775949 received exception java.io.EOFException:
while trying to read 65557 bytes
2012-08-20 00:03:29,264 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_-7841650651979512601_775949, targets=[<ip>:50010,
2012-08-20 00:03:29,440 INFO

We have seen multiple posts regarding the problem but could not find a
solution to it. We thought the region servers should be able to handle
these problems but it looks like they aren't.
How do we resolve this? Is there some tuning we need to do for the

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message