hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meng Mao <meng...@gmail.com>
Subject Re: EOFException and BadLink, but file descriptors number is ok?
Date Thu, 04 Feb 2010 19:52:20 GMT
I wrote a hadoop job that checks for ulimits across the nodes, and every
node is reporting:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 139264
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Is anything in there telling about file number limits? From what I
understand, a high open files limit like 65536 should be enough. I estimate
only a couple thousand part-files on HDFS being written to at once, and
around 200 on the filesystem per node.

On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao <mengmao@gmail.com> wrote:

> also, which is the ulimit that's important, the one for the user who is
> running the job, or the hadoop user that owns the Hadoop processes?
>
>
> On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao <mengmao@gmail.com> wrote:
>
>> I've been trying to run a fairly small input file (300MB) on Cloudera
>> Hadoop 0.20.1. The job I'm using probably writes to on the order of over
>> 1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
>> I get the following exception in the run logs:
>>
>> 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
>> attempt_201001261532_1137_r_000013_0, Status : FAILED
>> java.io.EOFException
>>     at java.io.DataInputStream.readByte(DataInputStream.java:250)
>>     at
>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>>     at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>>     at org.apache.hadoop.io.Text.readString(Text.java:400)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>>
>> ....lots of EOFExceptions....
>>
>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
>> attempt_201001261532_1137_r_000019_0, Status : FAILED
>> java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>>      at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
>>     at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>>
>> 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
>> 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
>> 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
>> 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
>> 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%
>>
>> From searching around, it seems like the most common cause of BadLink and
>> EOFExceptions is when the nodes don't have enough file descriptors set. But
>> across all the grid machines, the file-max has been set to 1573039.
>> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.
>>
>> Where else should I be looking for what's causing this?
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message