hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-134) JobTracker trapped in a loop if it fails to localize a task
Date Thu, 13 Apr 2006 23:59:00 GMT
JobTracker trapped in a loop if it fails to localize a task
-----------------------------------------------------------

         Key: HADOOP-134
         URL: http://issues.apache.org/jira/browse/HADOOP-134
     Project: Hadoop
        Type: Bug

  Components: mapred  
    Versions: 0.1.0    
    Reporter: Runping Qi



The symptoms:

    When I ran  jobs on a big cluster, I noticed that some jobs got stucked. Some map tasks
never got started. When I look at the log of the task tracker responsible for the tasks, I
saw the following exceptions:

060413 160702 Lost connection to JobTracker [kry1040/72.30.116.100:50020].  Retrying...
java.io.IOException: No valid local directories in property: mapred.local.dir
        at org.apache.hadoop.conf.Configuration.getFile(Configuration.java:282)
        at org.apache.hadoop.mapred.JobConf.getLocalFile(JobConf.java:127)
        at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:391)
        at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:383)
        at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:270)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:336)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:756)

The reason for the exception is that the directory hadoop/mapred/local has "wrong" owner,
thus the task tracker cannot access to it.
This caused the task tracker stucked into the following loops:

            while (running) {
                boolean staleState = false;
                try {
                    // This while-loop attempts reconnects if we get network errors
                    while (running && ! staleState) {
                        try {
                            if (offerService() == STALE_STATE) {
                                staleState = true;
                            }
                        } catch (Exception ex) {
                            LOG.log(Level.INFO, "Lost connection to JobTracker [" + jobTrackAddr
+ "].  Retrying...", ex);
                            try {
                                Thread.sleep(5000);
                            } catch (InterruptedException ie) {
                            }
                        }
                    }
                } finally {
                    close();
                }
                LOG.info("Reinitializing local state");
                initialize();
            }

Issue 1:
    Method offerService() must catch and handle the exceptions that may be thrown from new
TaskInProgress() call, and report back to the job tracker if it cannot run the task. This
way, the task can be assigned to other task tracker.

Issue 2:
    The taskTracker should check whether it can access to the local dir at the initialization
time, before taking any tasks.


Runping


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message