Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 27473 invoked from network); 16 Mar 2006 22:30:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 16 Mar 2006 22:30:36 -0000 Received: (qmail 75802 invoked by uid 500); 16 Mar 2006 22:30:35 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 75732 invoked by uid 500); 16 Mar 2006 22:30:34 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 75722 invoked by uid 99); 16 Mar 2006 22:30:34 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Mar 2006 14:30:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Mar 2006 14:30:34 -0800 Received: from ajax (localhost.localdomain [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 08DCED49FE for ; Thu, 16 Mar 2006 22:30:13 +0000 (GMT) Message-ID: <460543637.1142548212937.JavaMail.jira@ajax> Date: Thu, 16 Mar 2006 22:30:12 +0000 (GMT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-86) If corrupted map outputs, reducers get stuck fetching forever In-Reply-To: <1757119049.1142531938115.JavaMail.jira@ajax> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-86?page=all ] Doug Cutting updated HADOOP-86: ------------------------------- Attachment: mapout.patch Here's a completely untested patch. It does compile! I don't think we need to add a new method to the InterTrackerProtocol, rather we just need to get the failure propagated in the next heartbeat to the TaskTracker. > If corrupted map outputs, reducers get stuck fetching forever > ------------------------------------------------------------- > > Key: HADOOP-86 > URL: http://issues.apache.org/jira/browse/HADOOP-86 > Project: Hadoop > Type: Bug > Reporter: stack@archive.org > Attachments: mapout.patch > > In our rack, there is a machine that reliably corrupts map output parts. When reducers try to pickup the map output, Server#Handler checks the checksum, notices corruption, moves the bad map output part aside and throws a ChecksumException. Undeterred, the reducer comes back again minutes later only this time it gets a FileNotFoundException out of Server#Handler (Because the part was moved aside). And so it goes till the cows come home. > Doug applied a patch that in map output file, when it notices a fatal exception, it logs a severe error on the TaskTracker#LOG. Then in TT, if a severe logging has occurred, TT does a soft restart (TT stays up but closes down all services and then goes through init again). This patch was committed (after I suggested it was working), only, later, I noticed the severe log flag is not cleared across TT restart so TT goes into a cycle of continuous restarts. > A further patch that clears the severe flag was posted to the list. This improves things but has issues too in that on revival, the TT continues to be plagued by reducers looking for parts no longer available for a period of ten minutes or so until the JobTracker gets around to updating them about change in where to go get map outputs. During this period, the TT gets restarted 5-10 times -- but eventually comes back on line (There may have been too much damage done during this period of flux making it so the job will fail). > This issue covers implementing a better solution. > Suggestions include having the TT stay down a period to avoid the incoming reducers or somehow examining the incoming reducer request, checking its list of tasks to see if it knows anything of the reducers' request and rejecting it with a non-severe error if not a map of the currently running TT. A little birdie (named DC) suggests a better soln. is probably an addition to intertrackerprotocol so either the TT or the reducer updates JT when corrupted map output. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira