hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2393) TaskTracker locks up removing job files within a synchronized method
Date Tue, 08 Apr 2008 02:57:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586632#action_12586632
] 

Joydeep Sen Sarma commented on HADOOP-2393:
-------------------------------------------

ping.

we see this everytime a large job with a lot of mappers is killed (this one had 20K mappers
on 100 nodes - so about 200 map tasks per node) - all other jobs start timing out. of course
- a legitimate question is whether we need so many mappers - but in this case - the input
was arranged as a large number of files (and it's hard to train all users to use MultiFileInputFormat).

in looking at the code - i can patch this for the most part - what i don't understand entirely
is:

purgeJob():

        for (TaskInProgress tip : rjob.tasks) {
          tip.jobHasFinished(false);
        }
        // Delete the job directory for this                                             
                                                                                     
        // task if the job is done/failed                                                
                                                                                     
        if (!rjob.keepJobFiles){
          fConf.deleteLocalFiles(SUBDIR + Path.SEPARATOR + JOBCACHE +
                                 Path.SEPARATOR +  rjob.getJobId());
        }


- is there a dependency that the deleteLocalFiles() should only happen after the task cleanup?
(It's easy to make the task cleanup happen in the taskrunner thread itself by setting the
kill bit and making the wait for process status into one with timeouts).

any pointers appreciated ..

(changing the purgeJob routine to being a non-synchronized method seems way more risky/hard
to me ..)




> TaskTracker locks up removing job files within a synchronized method 
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown reason). The
task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same node are
also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings
to the task tracker are timed out and the child task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47
done; removing files.                                       
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0
done; removing files.                                        
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding
call ping(task_0149_m_000007_0) from 10.16.158.113:43941 
> 24 active threads                                                                   
                                                               
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to essentially stall.
all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050,
call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                                
                                                               
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050,
call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                                
                                                               
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050,
call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                                
                                                               
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050,
call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                                
                                                               
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1
Child Error     
> looking at code, failure of client to ping causes termination:
>               else {                                                                
                                                               
>                 // send ping                                                        
                                                               
>                 taskFound = umbilical.ping(taskId);                                 
                                                               
>               }                                                                     
                                                               
> ...
>             catch (Throwable t) {                                                   
                                                               
>               LOG.info("Communication exception: " + StringUtils.stringifyException(t));
                                                           
>               remainingRetries -=1;                                                 
                                                               
>               if (remainingRetries == 0) {                                          
                                                               
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);   
                                                               
>                 LOG.warn("Last retry, killing "+taskId);                            
                                                               
>                 System.exit(65);                                                    
                                                               
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely cause) -
and i would hate to bump up timeout because of this. Crap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message