Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mengmao@gmail.com designates
 209.85.210.199 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=ijnGr9e9bzOWhuF65zXQWvKFCuBgy6XNEWXSkaMSMn70uvrmJAcpYOTS1iF+Rri9GN
         BoujfuQ2ibSWV630/BAVvXI+jJ+TMqT+Hhsog7E6+W1EsW+fvhSZPB1sW3NvcOxUc3G+
         miterLnaT899wDtIPqo1G739ulFB7hoRti5bU=
MIME-Version: 1.0
In-Reply-To: <93dd73db1002041152n71577d9cje2e26d6ec773627f@mail.gmail.com>
References: <93dd73db1002021629v44f35845p8053fea1e1327bbe@mail.gmail.com>
	<93dd73db1002031304m3fc240f4yd8a404524f209a9a@mail.gmail.com>
	<93dd73db1002041152n71577d9cje2e26d6ec773627f@mail.gmail.com>
From: Meng Mao <mengmao@gmail.com>
Date: Fri, 5 Feb 2010 02:42:27 -0500
Message-ID: <93dd73db1002042342i8526f95p7e2c9afd01d3413a@mail.gmail.com>
Subject: Re: EOFException and BadLink, but file descriptors number is ok?
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636c925b1032f51047ed5995d

--001636c925b1032f51047ed5995d
Content-Type: text/plain; charset=UTF-8

not sure what else I could be checking to see where the problem lies. Should
I be looking in the datanode logs? I looked briefly in there and didn't see
anything from around the time exceptions started getting reported.
lsof during the job execution? Number of open threads?

I'm at a loss here.

On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao <mengmao@gmail.com> wrote:

> I wrote a hadoop job that checks for ulimits across the nodes, and every
> node is reporting:
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 139264
> max locked memory       (kbytes, -l) 32
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65536
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 139264
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
>
> Is anything in there telling about file number limits? From what I
> understand, a high open files limit like 65536 should be enough. I estimate
> only a couple thousand part-files on HDFS being written to at once, and
> around 200 on the filesystem per node.
>
> On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao <mengmao@gmail.com> wrote:
>
>> also, which is the ulimit that's important, the one for the user who is
>> running the job, or the hadoop user that owns the Hadoop processes?
>>
>>
>> On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao <mengmao@gmail.com> wrote:
>>
>>> I've been trying to run a fairly small input file (300MB) on Cloudera
>>> Hadoop 0.20.1. The job I'm using probably writes to on the order of over
>>> 1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
>>> I get the following exception in the run logs:
>>>
>>> 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
>>> attempt_201001261532_1137_r_000013_0, Status : FAILED
>>> java.io.EOFException
>>>     at java.io.DataInputStream.readByte(DataInputStream.java:250)
>>>     at
>>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>>>     at
>>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>>>     at org.apache.hadoop.io.Text.readString(Text.java:400)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>>>
>>> ....lots of EOFExceptions....
>>>
>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
>>> attempt_201001261532_1137_r_000019_0, Status : FAILED
>>> java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>>>      at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>>>
>>> 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
>>> 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
>>> 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
>>> 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
>>> 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%
>>>
>>> From searching around, it seems like the most common cause of BadLink and
>>> EOFExceptions is when the nodes don't have enough file descriptors set. But
>>> across all the grid machines, the file-max has been set to 1573039.
>>> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.
>>>
>>> Where else should I be looking for what's causing this?
>>>
>>
>>
>

--001636c925b1032f51047ed5995d--