Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A4A96DBFE for ; Tue, 19 Jun 2012 19:16:29 +0000 (UTC) Received: (qmail 74290 invoked by uid 500); 19 Jun 2012 19:10:23 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 52447 invoked by uid 500); 19 Jun 2012 19:09:02 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 13371 invoked by uid 99); 19 Jun 2012 17:39:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 17:39:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 17:38:57 +0000 Received: by pbbrq8 with SMTP id rq8so12225912pbb.35 for ; Tue, 19 Jun 2012 10:38:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer:x-gm-message-state; bh=m5QlHV/AnnG8lJS1+lmW7ovF8vVOifxHBhwDxw8IGEY=; b=m8Mn/i8GyeSuNEuI24yyl7nbdYCnMmxfb4A37IosrcjY2Z8+1sGU9vlytf/Jl4TET3 kJ9LXVuvAls5FO/A6gIPqpzeXIM5LPzQ/8WSJXPCAxdl/kj9MNDIRM5gJ/nYpGN8Zlmn kwXmG8vP0Lzyf2w2dvdhy239nfGbcFQ6zrFZixCxLctdhMGyGZKXAjMDGjooWG9FaL1q KVNlwHEBUEqOuIDSgPz1vQCYwmCSnGQgG0PmyXJeVMDhEIQ0ybs/G0u1dV9l57cRrHVN esmFFmn/dszs2rWzqoWfBDEipkl8YsAZ508LBHKBPY8UZ9bno0Lo7Tshoo0QWmWTvGgr thcg== Received: by 10.68.237.106 with SMTP id vb10mr48348584pbc.148.1340127516770; Tue, 19 Jun 2012 10:38:36 -0700 (PDT) Received: from [10.10.11.222] (host1.hortonworks.com. [70.35.59.2]) by mx.google.com with ESMTPS id ql3sm28574282pbc.72.2012.06.19.10.38.33 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 19 Jun 2012 10:38:34 -0700 (PDT) From: Vinod Kumar Vavilapalli Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: multipart/alternative; boundary="Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413" Subject: Re: Error: Too Many Fetch Failures Date: Tue, 19 Jun 2012 10:38:33 -0700 In-Reply-To: <4FE08C60.3020801@cse.psu.edu> To: common-user@hadoop.apache.org References: <4FE08C60.3020801@cse.psu.edu> Message-Id: <4BDBA86F-264A-4505-BA72-337A20406619@hortonworks.com> X-Mailer: Apple Mail (2.1278) X-Gm-Message-State: ALoCoQnFa5Xao66nt7wjfF4w7rGb5aYS/TBEQ0bARuHpewYDpFcLnhlrOvwkvQYCl2v7Cb74KV2z X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Replies/more questions inline. > I'm using Hadoop 0.23 on 50 machines, each connected with gigabit = ethernet and each having solely a single hard disk. I am getting the = following error repeatably for the TeraSort benchmark. TeraGen runs = without error, but TeraSort runs predictably until this error pops up = between 64% and 70% completion. This doesn't occur for every execution = of the benchmark, as about one out of four times that I run the = benchmark it does run to completion (TeraValidate included). How many containers are you running per node? > Error at the CLI: > "12/06/10 11:17:50 INFO mapreduce.Job: map 100% reduce 64% > 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : = attempt_1339331790635_0002_m_004337_0, Status : FAILED > Container killed by the ApplicationMaster. >=20 > Too Many fetch failures.Failing the attempt Clearly maps are getting killed because of fetch failures. Can you look = at the logs of the NodeManager where this particular map task ran. That = may have logs related to why reducers are not able to fetch map-outputs. = It is possible that because you have only one disk per node, some of = these nodes have bad or unfunctional disks and thereby causing fetch = failures. If that is the case, either you can offline these nodes or bump up = mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, = the default is 10. There are other some tweaks which I can tell if you = can find more details from your logs. HTH, +Vinod= --Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413--