From general-return-3464-apmail-hadoop-general-archive=hadoop.apache.org@hadoop.apache.org Wed May 11 16:45:47 2011 Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E770408B for ; Wed, 11 May 2011 16:45:47 +0000 (UTC) Received: (qmail 46093 invoked by uid 500); 11 May 2011 16:45:45 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 45970 invoked by uid 500); 11 May 2011 16:45:45 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 45961 invoked by uid 99); 11 May 2011 16:45:45 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2011 16:45:45 +0000 Received: from localhost (HELO dhcp-02.private.iobm.com) (127.0.0.1) (smtp-auth username aw, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2011 16:45:45 +0000 Subject: Re: Stability issue - dead DN's Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: text/plain; charset=us-ascii From: Allen Wittenauer In-Reply-To: Date: Wed, 11 May 2011 09:45:44 -0700 Cc: Content-Transfer-Encoding: quoted-printable Message-Id: References: To: X-Mailer: Apple Mail (2.1082) On May 11, 2011, at 5:57 AM, Eric Fiala wrote: >=20 > If we do the math that means [ map.tasks.max * mapred.child.java.opts = ] + > [ reduce.tasks.max * mapred.child.java.opts ] =3D> or [ 4 * 2.5G ] + [ = 4 * > 2.5G ] is greater than the amount of physical RAM in the machine. > This doesn't account for the base tasktracker and datanode process + = OS > overhead and whatever else may be hoarding resources on the systems. +1 to what Eric said. You've exhausted memory and now the whole system is falling = apart. =20 > I would play with this ratio, either less maps / reduces max - or = lower your > child.java.opts so that when you are fully subscribed you are not = using > more resource than the machine can offer. Yup. > Also, setting mapred.reduce.slowstart.completed.maps to 1.00 or some = other > value close to 1 would be one way to guarantee only 4 either maps or = reduces > to be running at once and address (albeit in a duct tape like way) the > oversubscription problem you are seeing (this represents the fractions = of > maps that should complete before initiating the reduce phase). slowstart isn't really going to help you much here. All it = takes is another job with the same settings running at the same time and = processes will start dying again. That said, the default for slowstart = is incredibly stupid for the vast majority. Something closer to .70 or = .80 is more realistic. >> * a 2x1GE bonded network interface for interconnects >> * a 2x1GE bonded network interface for external access Multiple NICs on a box can sometimes cause big performance = problems with Hadoop. So watch your traffic carefully.