Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 65357 invoked from network); 30 Jun 2008 21:08:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jun 2008 21:08:32 -0000 Received: (qmail 69652 invoked by uid 500); 30 Jun 2008 21:08:28 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 69611 invoked by uid 500); 30 Jun 2008 21:08:28 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 69600 invoked by uid 99); 30 Jun 2008 21:08:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jun 2008 14:08:28 -0700 X-ASF-Spam-Status: No, hits=3.5 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tarandeep@gmail.com designates 209.85.146.182 as permitted sender) Received: from [209.85.146.182] (HELO wa-out-1112.google.com) (209.85.146.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jun 2008 21:07:37 +0000 Received: by wa-out-1112.google.com with SMTP id m33so1359240wag.9 for ; Mon, 30 Jun 2008 14:07:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=VCnMrTOCGhICK00SzoklivRT1n8wDjGJcumJTviPW6A=; b=k3Qpy81RWZkon5y+ngi7RecS6xsDpaKTuwoeDaLPD7nYgvW8q60groAp0W53f/WG5U 61/c5O830F1VJpld8MLV9JR4uD9LGpjfLdrvVdDIXG3G5D3rRRGcVEbn07jr7y+0ox3g d9xpkFoAUDNZ2G0OJvMfLNT3Ez2o4yrvaGR30= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=DsnUqVyDdLoOvo0ODjAH7/71LnGGG4wB9IVD0r2Eqdj8KmaqmBlbS8lJ0YVEJLcoRe HmAO7zeN+B4A8TzWF0yb6y7NvJ++x9TzhxcakuM1rU6sPOLkp011CeXs52sBKyTfl7wD 936LWdB72/vkFztjJ7KgeEbryJSUCMIO/5Ebo= Received: by 10.115.32.8 with SMTP id k8mr4821370waj.89.1214860077796; Mon, 30 Jun 2008 14:07:57 -0700 (PDT) Received: by 10.114.175.15 with HTTP; Mon, 30 Jun 2008 14:07:57 -0700 (PDT) Message-ID: Date: Mon, 30 Jun 2008 14:07:57 -0700 From: "Tarandeep Singh" To: core-user@hadoop.apache.org Subject: Re: Too many fetch failures AND Shuffle error In-Reply-To: <485B5192.2090904@yahoo-inc.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_19087_5212221.1214860077792" References: <666648.13392.qm@web7613.mail.in.yahoo.com> <485B5192.2090904@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_19087_5212221.1214860077792 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I am getting this error as well. As Sayali mentioned in his mail, I updated the /etc/hosts file with the slave machines IP addresses, but I am still getting this error. Amar, which is the url that you were talking about in your mail - "There will be a URL associated with a map that the reducer try to fetch (check the reducer logs for this url)" Please tell me where should I look for it... I will try to access it manually to see if this error is due to firewall. Thanks, Taran On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat wrote: > Yeah. With 2 nodes the reducers will go up to 16% because the reducer are > able to fetch maps from the same machine (locally) but fails to copy it from > the remote machine. A common reason in such cases is the *restricted machine > access* (firewall etc). The web-server on a machine/node hosts map outputs > which the reducers on the other machine are not able to access. There will > be a URL associated with a map that the reducer try to fetch (check the > reducer logs for this url). Just try accessing it manually from the > reducer's machine/node. Most likely this experiment should also fail. Let us > know if this is not the case. > Amar > > Sayali Kulkarni wrote: > >> Can you post the reducer logs. How many nodes are there in the cluster? >>> >>> >> There are 6 nodes in the cluster - 1 master and 5 slaves >> I tried to reduce the number of nodes, and found that the problem is >> solved only if there is a single node in the cluster. So I can deduce that >> the problem is there in some configuration. >> >> Configuration file: >> >> >> >> >> >> >> >> >> hadoop.tmp.dir >> /extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name} >> A base for other temporary directories. >> >> >> >> fs.default.name >> hdfs://10.105.41.25:54310 >> The name of the default file system. A URI whose >> scheme and authority determine the FileSystem implementation. The >> uri's scheme determines the config property (fs.SCHEME.impl) naming >> the FileSystem implementation class. The uri's authority is used to >> determine the host, port, etc. for a filesystem. >> >> >> >> mapred.job.tracker >> 10.105.41.25:54311 >> The host and port that the MapReduce job tracker runs >> at. If "local", then jobs are run in-process as a single map >> and reduce task. >> >> >> >> >> dfs.replication >> 2 >> Default block replication. >> The actual number of replications can be specified when the file is >> created. >> The default is used if replication is not specified in create time. >> >> >> >> >> >> mapred.child.java.opts >> -Xmx1048M >> >> >> >> mapred.local.dir >> /extra/HADOOP/hadoop-0.16.3/tmp/mapred >> >> >> >> mapred.map.tasks >> 53 >> The default number of map tasks per job. Typically set >> to a prime several times greater than number of available hosts. >> Ignored when mapred.job.tracker is "local". >> >> >> >> >> mapred.reduce.tasks >> 7 >> The default number of reduce tasks per job. Typically set >> to a prime close to the number of available hosts. Ignored when >> mapred.job.tracker is "local". >> >> >> >> >> >> >> ============ >> This is the output that I get when running the tasks with 2 nodes in the >> cluster: >> >> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to >> process : 1 >> 08/06/20 11:07:45 INFO mapred.JobClient: Running job: >> job_200806201106_0001 >> 08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0% >> 08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0% >> 08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0% >> 08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0% >> 08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0% >> 08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0% >> 08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0% >> 08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0% >> 08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0% >> 08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0% >> 08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0% >> 08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0% >> 08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0% >> 08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1% >> 08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1% >> 08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3% >> 08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3% >> 08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7% >> 08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10% >> 08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11% >> 08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12% >> 08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14% >> 08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15% >> 08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13% >> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_r_000002_0, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> 08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9% >> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_r_000003_0, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000011_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9% >> 08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9% >> 08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10% >> 08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11% >> 08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13% >> 08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14% >> 08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15% >> 08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16% >> 08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13% >> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_r_000000_0, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> 08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11% >> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_r_000001_0, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000003_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12% >> 08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13% >> 08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14% >> 08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16% >> 08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16% >> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000020_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16% >> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000017_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17% >> 08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17% >> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000007_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17% >> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000012_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17% >> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000019_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18% >> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000002_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18% >> 08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19% >> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000006_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17% >> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_r_000003_1, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> 08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18% >> 08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19% >> 08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19% >> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000010_0, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19% >> 08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20% >> 08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33% >> 08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47% >> 08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60% >> 08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62% >> 08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63% >> 08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64% >> 08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65% >> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id : >> task_200806201106_0001_m_000019_1, Status : FAILED >> Too many fetch-failures >> 08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66% >> 08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67% >> 08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80% >> >> =============== >> >> >> >>> Are you seeing this for all the maps and reducers? >>> >> Yes, this happens on all the maps and reducers. I tried to keep just 2 >> nodes in the cluster but still the problem exists. >> >> >> >>> Are the reducers progressing at all? >>> >>> >> The reducers continue to execute upto a certain point, but after that they >> just do not proceed at all. They just stop at an average of 16%. >> >> >>> Are all the maps that the reducer is failing from a remote machine? >>> >> Yes. >> >> >> >>> Are all the failed maps/reducers from the same machine? >>> >> All the maps and reducers are failing anyways. >> Thanks for the help in advance, >> >> Regards, >> Sayali >> >> --------------------------------- >> Sent from Yahoo! Mail. >> A Smarter Email. >> >> > > ------=_Part_19087_5212221.1214860077792--