Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 76016 invoked from network); 25 Jun 2010 04:41:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Jun 2010 04:41:26 -0000 Received: (qmail 69790 invoked by uid 500); 25 Jun 2010 04:41:26 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 69322 invoked by uid 500); 25 Jun 2010 04:41:23 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 69314 invoked by uid 99); 25 Jun 2010 04:41:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Jun 2010 04:41:22 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yhemanth@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pw0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Jun 2010 04:41:16 +0000 Received: by pwj5 with SMTP id 5so3126911pwj.35 for ; Thu, 24 Jun 2010 21:40:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=PoBWhX5sEack+2b51ddsCcEK8XndLxTT9/pdBRdgUSI=; b=GG3doJw3zwWudf+9kUckVHQ0XkMjFjVHPPCf146NvjnFHicJbS9xMwZRPYvjooFLzS 7VypG6zJbpazM5/ln166dHRCl+FcyFfBQrMxR1FZji80I+RN+vxQ3qxJ9ftakbON+Gbk DYVW/25lqjVb+NCTK1p5iqC3cX0BhH1BvRz0A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=JJMu0g6GZC34YAMCqm+5LCRHlyjfl+DXTxAUexVy8gOHeaFVdarMSWppNZ+vhkYZAB 0dhgD9wEJbyTQ3XuMk4XLRQecS26tLV2V1hL+bsmFnDZYzr5f1XD2iZHE8KyBp+ZAVZs vg5f7iz54svgWYINwlyx6cEw1Zk+mom7WpaVw= MIME-Version: 1.0 Received: by 10.142.207.15 with SMTP id e15mr137980wfg.14.1277440855963; Thu, 24 Jun 2010 21:40:55 -0700 (PDT) Received: by 10.142.188.19 with HTTP; Thu, 24 Jun 2010 21:40:55 -0700 (PDT) In-Reply-To: References: Date: Fri, 25 Jun 2010 10:10:55 +0530 Message-ID: Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. From: Hemanth Yamijala To: mapreduce-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi, > I've been getting the following error when trying to run a very simple > MapReduce job. > Map finishes without problem, but error occurs as soon as it enters > Reduce phase. > > 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : > attempt_201006241812_0001_r_000000_0, Status : FAILED > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > > I am running a 5 node cluster and I believe I have all my settings correc= t: > > * ulimit -n 32768 > * DNS/RDNS configured properly > * hdfs-site.xml : http://pastebin.com/xuZ17bPM > * mapred-site.xml : http://pastebin.com/JraVQZcW > > The program is very simple - just counts a unique string in a log file. > See here: http://pastebin.com/5uRG3SFL > > When I run, the job fails and I get the following output. > http://pastebin.com/AhW6StEb > > However, runs fine when I do *not* use substring() on the value (see > map function in code above). > > This runs fine and completes successfully: > =A0 =A0 =A0 =A0 =A0 =A0String str =3D val.toString(); > > This causes error and fails: > =A0 =A0 =A0 =A0 =A0 =A0String str =3D val.toString().substring(0,10); > > Please let me know if you need any further information. > It would be greatly appreciated if anyone could shed some light on this p= roblem. It catches attention that changing the code to use a substring is causing a difference. Assuming it is consistent and not a red herring, can you look at the counters for the two jobs using the JobTracker web UI - things like map records, bytes etc and see if there is a noticeable difference ? Also, are the two programs being run against the exact same input data ? Also, since the cluster size is small, you could also look at the tasktracker logs on the machines where the maps have run to see if there are any failures when the reduce attempts start failing. Thanks Hemanth