Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Message-ID: <44183E80.7090604@archive.org>
Date: Wed, 15 Mar 2006 08:19:12 -0800
From: Michael Stack <stack@archive.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
 rv:1.8.0.1) Gecko/20060127 SeaMonkey/1.0
MIME-Version: 1.0
To: hadoop-dev@lucene.apache.org
Subject: Re: Hung job
References: <4411D1E5.3080807@archive.org> <4415F5A0.7050308@apache.org>
 <441603B7.9080809@archive.org> <44161245.8060301@apache.org>
In-Reply-To: <44161245.8060301@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I ran overnight with the patch submitted to this list yesterday that 
adds a LogFormatter.resetLoggedSevere.  Twice during the night the 
TaskTracker was restarted because map outputs failed checksum when 
reducer came in to pick up map output parts.  Each time TaskTracker came 
back up... eventually.  The interesting thing was that it took 9 and 12 
restarts respectively as TaskTracker would restart anew because we 
didn't have the map output an incoming reducer was asking for (I'm 
assuming the incoming reducer has not yet been updated by jobtracker of 
the new state of affairs).

This situation is a big improvement over how things used work but seems 
as though we should try and avoid the TaskTracker start/stop churn.  
Possibilities:

1. Add a damper so TaskTracker keeps its head down a while so its not 
around when Reducer's come looking for missing map outputs, or
2. Not have map output file log severe if taskid map part being 
requested is not one the TaskTracker knows about.

Neither of the above is very pretty.  Any other suggestions?  Otherwise 
I'll look into a patch to do a variation on 2. above.

Thanks,
St.Ack


Doug Cutting wrote:
> stack wrote:
>> Yes. Sounds like right thing to do. Minor comments in the below. 
>> Meantime, let me try it.
>
> Great.  Please report on whether this works for you.
>
>> Should there be a 'throw e;' after TaskTracker.LOG.log above?
>
> Yes.  You're right, there should be.
>
> Cheers,
>
> Doug