hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meng Mao <meng...@gmail.com>
Subject EOFException and BadLink, but file descriptors number is ok?
Date Wed, 03 Feb 2010 00:29:42 GMT
I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop
0.20.1. The job I'm using probably writes to on the order of over 1000
part-files at once, across the whole grid. The grid has 33 nodes in it. I
get the following exception in the run logs:

10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000013_0, Status : FAILED
java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

....lots of EOFExceptions....

10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000019_0, Status : FAILED
java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
     at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%

>From searching around, it seems like the most common cause of BadLink and
EOFExceptions is when the nodes don't have enough file descriptors set. But
across all the grid machines, the file-max has been set to 1573039.
Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

Where else should I be looking for what's causing this?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message