Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 3034 invoked from network); 5 Feb 2010 07:43:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Feb 2010 07:43:17 -0000 Received: (qmail 99206 invoked by uid 500); 5 Feb 2010 07:43:15 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 99118 invoked by uid 500); 5 Feb 2010 07:43:15 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 99108 invoked by uid 99); 5 Feb 2010 07:43:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Feb 2010 07:43:15 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mengmao@gmail.com designates 209.85.210.199 as permitted sender) Received: from [209.85.210.199] (HELO mail-yx0-f199.google.com) (209.85.210.199) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Feb 2010 07:43:07 +0000 Received: by yxe37 with SMTP id 37so4126134yxe.31 for ; Thu, 04 Feb 2010 23:42:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=Nnwwg0j/hjSSznWJiBgQTbQmIisjkRE8C8HE5tjLP38=; b=BkaKztGQ4/zBb3YuiZn36wYCrTi8GxqGIDBw4MSmnwp26mnybTWzUvDX4VYAbU9glB h8i4kSno22rY9RRwpOwpdaTDBgbR91ZprOmPrCpoHdSCl7FAbzUodomXPw3HJIDj3tIp S0ET0/0AHS1ZrusdXbSbhbx2PCS1ViDFxPJww= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=ijnGr9e9bzOWhuF65zXQWvKFCuBgy6XNEWXSkaMSMn70uvrmJAcpYOTS1iF+Rri9GN BoujfuQ2ibSWV630/BAVvXI+jJ+TMqT+Hhsog7E6+W1EsW+fvhSZPB1sW3NvcOxUc3G+ miterLnaT899wDtIPqo1G739ulFB7hoRti5bU= MIME-Version: 1.0 Received: by 10.101.106.19 with SMTP id i19mr3095352anm.186.1265355767098; Thu, 04 Feb 2010 23:42:47 -0800 (PST) In-Reply-To: <93dd73db1002041152n71577d9cje2e26d6ec773627f@mail.gmail.com> References: <93dd73db1002021629v44f35845p8053fea1e1327bbe@mail.gmail.com> <93dd73db1002031304m3fc240f4yd8a404524f209a9a@mail.gmail.com> <93dd73db1002041152n71577d9cje2e26d6ec773627f@mail.gmail.com> From: Meng Mao Date: Fri, 5 Feb 2010 02:42:27 -0500 Message-ID: <93dd73db1002042342i8526f95p7e2c9afd01d3413a@mail.gmail.com> Subject: Re: EOFException and BadLink, but file descriptors number is ok? To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636c925b1032f51047ed5995d --001636c925b1032f51047ed5995d Content-Type: text/plain; charset=UTF-8 not sure what else I could be checking to see where the problem lies. Should I be looking in the datanode logs? I looked briefly in there and didn't see anything from around the time exceptions started getting reported. lsof during the job execution? Number of open threads? I'm at a loss here. On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao wrote: > I wrote a hadoop job that checks for ulimits across the nodes, and every > node is reporting: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 139264 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 65536 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 139264 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Is anything in there telling about file number limits? From what I > understand, a high open files limit like 65536 should be enough. I estimate > only a couple thousand part-files on HDFS being written to at once, and > around 200 on the filesystem per node. > > On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao wrote: > >> also, which is the ulimit that's important, the one for the user who is >> running the job, or the hadoop user that owns the Hadoop processes? >> >> >> On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote: >> >>> I've been trying to run a fairly small input file (300MB) on Cloudera >>> Hadoop 0.20.1. The job I'm using probably writes to on the order of over >>> 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. >>> I get the following exception in the run logs: >>> >>> 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% >>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : >>> attempt_201001261532_1137_r_000013_0, Status : FAILED >>> java.io.EOFException >>> at java.io.DataInputStream.readByte(DataInputStream.java:250) >>> at >>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) >>> at >>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) >>> at org.apache.hadoop.io.Text.readString(Text.java:400) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) >>> >>> ....lots of EOFExceptions.... >>> >>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : >>> attempt_201001261532_1137_r_000019_0, Status : FAILED >>> java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) >>> >>> 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% >>> 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% >>> 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% >>> 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% >>> 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% >>> >>> From searching around, it seems like the most common cause of BadLink and >>> EOFExceptions is when the nodes don't have enough file descriptors set. But >>> across all the grid machines, the file-max has been set to 1573039. >>> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. >>> >>> Where else should I be looking for what's causing this? >>> >> >> > --001636c925b1032f51047ed5995d--