hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Naegele" <jnaeg...@grierforensics.com>
Subject SocketTimeoutException in DataXceiver
Date Tue, 20 Dec 2016 16:11:55 GMT
Hi folks,

I'm experiencing the exact symptoms of HDFS-770 (https://issues.apache.org/jira/browse/HDFS-770)
using Spark and a basic HDFS deployment. Everything is running locally on a single machine.
I'm using Hadoop 2.7.3. My HDFS deployment consists of a single 8 TB disk with replication
disabled, otherwise everything is vanilla Hadoop 2.7.3. My Spark job uses a Hive ORC writer
to write a  dataset to disk. The dataset itself is < 100 GB uncompressed, ~17 GB compressed.

It does not appear to be a Spark issue. The datanode's logs show it receives the first ~500
packets for a block, then nothing for a minute, then the default channel read timeout of 60000
ms causes the exception:

2016-12-19 18:36:50,632 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock
BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632 received exception java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
2016-12-19 18:36:50,632 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: lamport.grierforensics.com:50010:DataXceiver
error processing WRITE_BLOCK operation  src: /127.0.0.1:55866 dst: /127.0.0.1:50010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        ...

On the Spark side, all is well until the datanode's socket exception results in Spark experiencing
a DFSOutputStream ResponseProcessor exception, followed by Spark aborting due to all datanodes
being bad:

2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for block
BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)

...
Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting...
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

I haven't tried adjusting the timeout yet for the same reason specified by the reporter of
HDFS-770: I'm running everything locally, with no other tasks running on the system so why
would I need a socket read timeout greater than 60 seconds? I haven't observed any CPU, memory
or disk bottlenecks.

Lowering the number of cores used by Spark does help alleviate the problem, but doesn't eliminate
it, which led me to believe the issue may be disk contention (i.e. too many client writers?),
but again, I haven't observed any disk IO bottlenecks at all.

Does anyone else still experience HDFS-770 (https://issues.apache.org/jira/browse/HDFS-770)
and is there a general approach/solution?

Thanks

---
Joe Naegele
Grier Forensics



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Mime
View raw message