hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7441) More accurate slow node detection in HDFS write pipeline
Date Tue, 25 Nov 2014 06:40:12 GMT
Ming Ma created HDFS-7441:
-----------------------------

             Summary: More accurate slow node detection in HDFS write pipeline
                 Key: HDFS-7441
                 URL: https://issues.apache.org/jira/browse/HDFS-7441
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes couldn't detect the
slow DN correctly.

In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN that should have
been removed. But HDFS took out the healthy DN 5.6.7.8. With the new pipeline, HDFS continued
to take out the newly added healthy DN 9.10.11.12, etc. 

DFSClient log on 1.2.3.4
{noformat}
2014-11-19 20:50:22,601 WARN [ResponseProcessor for block blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block blk_1157561391_1102030131492
java.io.IOException: Bad response ERROR for block blk_1157561391_1102030131492 from datanode
5.6.7.8:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block blk_1157561391_1102030131492]
org.apache.hadoop.hdfs.DFSClient: Error Recovery for blk_1157561391_1102030131492 in pipeline
1.2.3.4:50010, 5.6.7.8:50010: bad datanode 5.6.7.8:50010
{noformat}

DN Log on 1.2.3.4
{noformat}
2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock
blk_1157561391_1102030131492 received exception java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
...
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
{noformat}


DN Log on 5.6.7.8
{noformat}
2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
blk_1157561391_1102030131492
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message