Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 58522 invoked from network); 15 Mar 2006 16:19:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Mar 2006 16:19:39 -0000 Received: (qmail 14597 invoked by uid 500); 15 Mar 2006 16:19:35 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 14557 invoked by uid 500); 15 Mar 2006 16:19:35 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 14547 invoked by uid 99); 15 Mar 2006 16:19:34 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Mar 2006 08:19:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [63.203.238.114] (HELO dns.duboce.net) (63.203.238.114) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Mar 2006 08:19:34 -0800 Received: from [192.168.1.105] ([192.168.1.105]) by dns-eth1.duboce.net (8.10.2/8.10.2) with ESMTP id k2FF9LZ20474; Wed, 15 Mar 2006 07:09:21 -0800 Message-ID: <44183E80.7090604@archive.org> Date: Wed, 15 Mar 2006 08:19:12 -0800 From: Michael Stack User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.1) Gecko/20060127 SeaMonkey/1.0 MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: Hung job References: <4411D1E5.3080807@archive.org> <4415F5A0.7050308@apache.org> <441603B7.9080809@archive.org> <44161245.8060301@apache.org> In-Reply-To: <44161245.8060301@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I ran overnight with the patch submitted to this list yesterday that adds a LogFormatter.resetLoggedSevere. Twice during the night the TaskTracker was restarted because map outputs failed checksum when reducer came in to pick up map output parts. Each time TaskTracker came back up... eventually. The interesting thing was that it took 9 and 12 restarts respectively as TaskTracker would restart anew because we didn't have the map output an incoming reducer was asking for (I'm assuming the incoming reducer has not yet been updated by jobtracker of the new state of affairs). This situation is a big improvement over how things used work but seems as though we should try and avoid the TaskTracker start/stop churn. Possibilities: 1. Add a damper so TaskTracker keeps its head down a while so its not around when Reducer's come looking for missing map outputs, or 2. Not have map output file log severe if taskid map part being requested is not one the TaskTracker knows about. Neither of the above is very pretty. Any other suggestions? Otherwise I'll look into a patch to do a variation on 2. above. Thanks, St.Ack Doug Cutting wrote: > stack wrote: >> Yes. Sounds like right thing to do. Minor comments in the below. >> Meantime, let me try it. > > Great. Please report on whether this works for you. > >> Should there be a 'throw e;' after TaskTracker.LOG.log above? > > Yes. You're right, there should be. > > Cheers, > > Doug