Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 60B699664 for ; Fri, 27 Jan 2012 06:38:11 +0000 (UTC) Received: (qmail 82493 invoked by uid 500); 27 Jan 2012 06:38:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 79478 invoked by uid 500); 27 Jan 2012 06:37:59 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 79450 invoked by uid 99); 27 Jan 2012 06:37:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jan 2012 06:37:57 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of psychidris@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bk0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jan 2012 06:37:53 +0000 Received: by bkat8 with SMTP id t8so1647459bka.35 for ; Thu, 26 Jan 2012 22:37:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ZgDnvvIZb0aEgS+VfxBxXhxM/2EP7avi225LuD+A30M=; b=x99aAgiwvMJZr2D07gaxv/k+ohwuYdau6gn/4KdO6/WT672yL/3javozQO6pqKAcNb vRaIHMTWuQRL8K2YAUyF2strNUyj1KiLBSd1h/8Wg5olHBgFpJ/dhKh7xgCdynugFegO rwECVqysuXeuVjopHZcE9nULabYHwW0hsMAhg= MIME-Version: 1.0 Received: by 10.204.156.18 with SMTP id u18mr2449963bkw.32.1327646251488; Thu, 26 Jan 2012 22:37:31 -0800 (PST) Received: by 10.205.24.137 with HTTP; Thu, 26 Jan 2012 22:37:31 -0800 (PST) In-Reply-To: References: <9D9237CF-C04C-4764-AEAB-4939122EA0D4@gmail.com> <1327606802.59901.YahooMailNeo@web160706.mail.bf1.yahoo.com> Date: Fri, 27 Jan 2012 12:07:31 +0530 Message-ID: Subject: Re: Too many open files Error From: Idris Ali To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175cabde35155304b77cbb86 --0015175cabde35155304b77cbb86 Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, As Harsh pointed out it is not good idea to increase the Xceiver count to arbitrarily higher value, I suggested to increase the xceiver count just to unblock execution of your program temporarily. Thanks, -Idris On Fri, Jan 27, 2012 at 10:39 AM, Harsh J wrote: > You are technically allowing DN to run 1 million block transfer > (in/out) threads by doing that. It does not take up resources by > default sure, but now it can be abused with requests to make your DN > run out of memory and crash cause its not bound to proper limits now. > > On Fri, Jan 27, 2012 at 5:49 AM, Mark question > wrote: > > Harsh, could you explain briefly why is 1M setting for xceiver is bad? > the > > job is working now ... > > about the ulimit -u it shows 200703, so is that why connection is reset > by > > peer? How come it's working with the xceiver modification? > > > > Thanks, > > Mark > > > > > > On Thu, Jan 26, 2012 at 12:21 PM, Harsh J wrote: > > > >> Agree with Raj V here - Your problem should not be the # of transfer > >> threads nor the number of open files given that stacktrace. > >> > >> And the values you've set for the transfer threads are far beyond > >> recommendations of 4k/8k - I would not recommend doing that. Default > >> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have > >> when noticing increased HDFS load, or when running services like > >> HBase. > >> > >> You should instead focus on why its this particular job (or even > >> particular task, which is important to notice) that fails, and not > >> other jobs (or other task attempts). > >> > >> On Fri, Jan 27, 2012 at 1:10 AM, Raj V wrote: > >> > Mark > >> > > >> > You have this "Connection reset by peer". Why do you think this > problem > >> is related to too many open files? > >> > > >> > Raj > >> > > >> > > >> > > >> >>________________________________ > >> >> From: Mark question > >> >>To: common-user@hadoop.apache.org > >> >>Sent: Thursday, January 26, 2012 11:10 AM > >> >>Subject: Re: Too many open files Error > >> >> > >> >>Hi again, > >> >>I've tried : > >> >> > >> >> dfs.datanode.max.xcievers > >> >> 1048576 > >> >> > >> >>but I'm still getting the same error ... how high can I go?? > >> >> > >> >>Thanks, > >> >>Mark > >> >> > >> >> > >> >> > >> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question > >> wrote: > >> >> > >> >>> Thanks for the reply.... I have nothing about > >> dfs.datanode.max.xceivers on > >> >>> my hdfs-site.xml so hopefully this would solve the problem and about > >> the > >> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start > >> Hadoop > >> >>> with a single bin/start-all.sh ... Do you think I can add it by > >> >>> bin/Datanode -ulimit n ? > >> >>> > >> >>> Mark > >> >>> > >> >>> > >> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn < > mapred.learn@gmail.com > >> >wrote: > >> >>> > >> >>>> U need to set ulimit -n on datanode and restart > >> datanodes. > >> >>>> > >> >>>> Sent from my iPhone > >> >>>> > >> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali > wrote: > >> >>>> > >> >>>> > Hi Mark, > >> >>>> > > >> >>>> > On a lighter note what is the count of xceivers? > >> >>>> dfs.datanode.max.xceivers > >> >>>> > property in hdfs-site.xml? > >> >>>> > > >> >>>> > Thanks, > >> >>>> > -idris > >> >>>> > > >> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel < > >> >>>> michael_segel@hotmail.com>wrote: > >> >>>> > > >> >>>> >> Sorry going from memory... > >> >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a > >> ulimit > >> >>>> -a? > >> >>>> >> That should give you the number of open files allowed by a > single > >> >>>> user... > >> >>>> >> > >> >>>> >> > >> >>>> >> Sent from a remote device. Please excuse any typos... > >> >>>> >> > >> >>>> >> Mike Segel > >> >>>> >> > >> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question > > >> >>>> wrote: > >> >>>> >> > >> >>>> >>> Hi guys, > >> >>>> >>> > >> >>>> >>> I get this error from a job trying to process 3Million > records. > >> >>>> >>> > >> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink > >> >>>> >> 192.168.1.20:50010 > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) > >> >>>> >>> > >> >>>> >>> When I checked the logfile of the datanode-20, I see : > >> >>>> >>> > >> >>>> >>> 2012-01-26 03:00:11,827 ERROR > >> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> DatanodeRegistration( > >> >>>> >>> 192.168.1.20:50010, > >> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369, > >> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver > >> >>>> >>> java.io.IOException: Connection reset by peer > >> >>>> >>> at sun.nio.ch.FileDispatcher.read0(Native Method) > >> >>>> >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > >> >>>> >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > >> >>>> >>> at sun.nio.ch.IOUtil.read(IOUtil.java:175) > >> >>>> >>> at > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > >> >>>> >>> at > >> >>>> >>> > >> >>>> > >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >> >>>> >>> at > >> >>>> >>> > >> >>>> > >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >> >>>> >>> at > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >> >>>> >>> at > >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >> >>>> >>> at java.io.DataInputStream.read(DataInputStream.java:132) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) > >> >>>> >>> at > >> >>>> >>> > >> >>>> >> > >> >>>> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) > >> >>>> >>> at java.lang.Thread.run(Thread.java:662) > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 > node > >> >>>> >> cluster, > >> >>>> >>> each map opens about 300 files so that should give 6000 opened > >> files > >> >>>> at > >> >>>> >> the > >> >>>> >>> same time ... why is this a problem? the maximum # of files per > >> >>>> process > >> >>>> >> on > >> >>>> >>> one machine is: > >> >>>> >>> > >> >>>> >>> cat /proc/sys/fs/file-max ---> 2403545 > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> Any suggestions? > >> >>>> >>> > >> >>>> >>> Thanks, > >> >>>> >>> Mark > >> >>>> >> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> > >> > >> > >> -- > >> Harsh J > >> Customer Ops. Engineer, Cloudera > >> > > > > -- > Harsh J > Customer Ops. Engineer, Cloudera > --0015175cabde35155304b77cbb86--