hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leon Mergen <l.p.mer...@solatis.com>
Subject SocketTimeoutException: timeout while waiting for channel to be ready for read
Date Wed, 11 Nov 2009 11:11:27 GMT
Hello,

The following problem has occured three times in the past 2 weeks for us now, twice of which
in the last 24 hours. 

Our setup currently is a single server, which runs "everything": namenode, datanode and client.
We're using JNI in the client as an interface to HDFS.

It seems as if, after a while of operation, the client is unable to write any more data to
the datanode and throws an exception. The symptoms seem to be similar to http://issues.apache.org/jira/browse/HDFS-693
.

Here are the relevant logs from the client:


-----

09/11/11 10:58:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
blk_-5973015382562856275_32855java.net.SocketTimeoutException: 63000 millis timeout while
waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:52549
remote=/127.0.0.1:50010]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2367)

09/11/11 10:58:22 WARN hdfs.DFSClient: Error Recovery for block blk_-5973015382562856275_32855
bad datanode[0] 127.0.0.1:50010

Exception in thread "Thread-9" java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2493)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2046)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209)
Call to org.apache.hadoop.fs.FSDataOutputStream::write failed!

-----

I'm not entirely sure what to make of this: it seems like some sort of hickup, since the data
node is available quickly afterwards (as a matter of fact, I am unable to reproduce this error
in a controlled way -- I've only been able to notice the applications crash while not watching).

Here are some relevant logs in the DataNode:

-----

2009-11-11 10:56:30,429 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_6654737502435554218_32905 src: /127.0.0.1:33103 dest: /127.0.0.1:50010
2009-11-11 10:56:30,854 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:33103, dest: /127.0.0.1:50010, bytes: 488, op: HDFS_WRITE, cliID: DFSClient_attempt_200910192055_3176_r_000000_0,
srvID: DS-1076946
65-62.129.149.242-50010-1254143880412, blockid: blk_6654737502435554218_32905
2009-11-11 10:56:30,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_6654737502435554218_32905 terminating
2009-11-11 10:56:36,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:50010, dest: /127.0.0.1:33161, bytes: 222249, op: HDFS_READ, cliID: DFSClient_attempt_200910192055_3178_m_000001_0,
srvID: DS-10769
4665-62.129.149.242-50010-1254143880412, blockid: blk_-5388231964351796801_32881
2009-11-11 10:56:41,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_698694794042576935_32906 src: /127.0.0.1:33208 dest: /127.0.0.1:50010
2009-11-11 10:56:41,463 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:33192, dest: /127.0.0.1:50010, bytes: 246338, op: HDFS_WRITE, cliID: DFSClient_-648412294,
srvID: DS-107694665-62.129.149.242-50010
-1254143880412, blockid: blk_-5491664625164378664_32900
2009-11-11 10:56:41,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_-5491664625164378664_32900 terminating
2009-11-11 10:56:41,737 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:33208, dest: /127.0.0.1:50010, bytes: 1757, op: HDFS_WRITE, cliID: DFSClient_attempt_200910192055_3177_r_000000_0,
srvID: DS-107694
665-62.129.149.242-50010-1254143880412, blockid: blk_698694794042576935_32906
2009-11-11 10:56:41,746 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_698694794042576935_32906 terminating
2009-11-11 10:58:27,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_-5973015382562856275_32855 java.io.EOFException: while trying to
read 65557 bytes
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_-5973015382562856275_32855 Interrupted.
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_-5973015382562856275_32855 terminating
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-6041681763208648701_32907 src: /127.0.0.1:33655 dest: /127.0.0.1:50010
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_9134662523757703658_32907 src: /127.0.0.1:33688 dest: /127.0.0.1:50010
2009-11-11 10:58:27,519 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:50010, dest: /127.0.0.1:33533, bytes: 0, op: HDFS_READ, cliID: DFSClient_-1667266750,
srvID: DS-107694665-62.129.149.242-50010-1254143880412, blockid: blk_6654737502435554218_32905
2009-11-11 10:58:27,519 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-5973015382562856275_32855
received exception java.io.EOFException: while trying to read 65557 bytes
2009-11-11 10:58:27,531 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010,
storageID=DS-107694665-62.129.149.242-50010-1254143880412, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
        at java.lang.Thread.run(Thread.java:619)
2009-11-11 10:58:30,442 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /127.0.0.1:33655, dest: /127.0.0.1:50010, bytes: 9833, op: HDFS_WRITE, cliID: DFSClient_-648412294,
srvID: DS-107694665-62.129.149.242-50010-1254143880412, blockid: blk_-6041681763208648701_32907

-----

Again, not much useful information: the exception shown in the logs has a later timestamp
than the actual error occurring in the client. 

Now, I'm wondering about a few things:

 - What could be causing these problems? A network error seems out of the question, since
all communication is done via localhost 127.0.0.1, but I'm having trouble figuring out which
steps to take next to investigate the problem.

 - Would this problem (or these class of problems) be related to the fact I'm only running
a single datanode? In other words, would the client still throw an exception if it has multiple
data nodes to choose from and only a single one fails to respond ?

 - Given the fact that the DataNode seems to get back into normal operation somewhere after
the error occurred, would increasing the timeout from 63 seconds to, say, 300 seconds be a
wise idea ? Are these kind of "hiccups" to be expected in normal mode of operation?


Thanks in advance!

Regards,

- Leon Mergen

Mime
View raw message