Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <19994055.1187878230758.JavaMail.jira@brutus>
Date: Thu, 23 Aug 2007 07:10:30 -0700 (PDT)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-1158) JobTracker should collect
 statistics of failed map output fetches, and take decisions to reexecute
 map tasks and/or restart the (possibly faulty) Jetty server on the
 TaskTracker
In-Reply-To: <7318675.1174848452133.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522150 ] 

Enis Soztutar commented on HADOOP-1158:
---------------------------------------

The patch looks good, but i would like to mention another major issue here. 
There are some cases when TaskTracker send heartbeats, but the jetty server cannot serve the outputs. Recently we have seen the jetty servers failing to allocate new threads from the thread pool on some of the tasktrackers, emiting logs:
{noformat}
  2007-08-23 09:31:46,378 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
  2007-08-23 09:31:46,379 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060
{noformat}

Moreover, HADOOP-1179 mentions OOM exceptions related to Jetty. We will try to find and eliminate the sources of jetty related leaks and bugs, but it is not likely that all of them will be resolved. There will be cases such as above that RPC responds but http may not, so taking a "computer engineering approach" by solving the problem by restarting seems appropriate. 

long story short, i think it would be great to do some bookkeeping in JT about failed fetches per TT and send reinit action to TT above some threshold. 

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.