Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <176492628.1212733245225.JavaMail.jira@brutus>
Date: Thu, 5 Jun 2008 23:20:45 -0700 (PDT)
From: "Amareshwari Sriramadasu (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Updated: (HADOOP-2393) TaskTracker locks up removing job
 files within a synchronized method
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-2393:
--------------------------------------------

    Status: Open  (was: Patch Available)

trying hudson again

> TaskTracker locks up removing job files within a synchronized method 
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Assignee: Amareshwari Sriramadasu
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt, patch-2393.txt
>
>
> we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.                                       
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.                                        
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941 
> 24 active threads                                                                                                                                   
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                                                                                                
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error     
> looking at code, failure of client to ping causes termination:
>               else {                                                                                                                                
>                 // send ping                                                                                                                        
>                 taskFound = umbilical.ping(taskId);                                                                                                 
>               }                                                                                                                                     
> ...
>             catch (Throwable t) {                                                                                                                   
>               LOG.info("Communication exception: " + StringUtils.stringifyException(t));                                                            
>               remainingRetries -=1;                                                                                                                 
>               if (remainingRetries == 0) {                                                                                                          
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);                                                                   
>                 LOG.warn("Last retry, killing "+taskId);                                                                                            
>                 System.exit(65);                                                                                                                    
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.