hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Esteban Gutierrez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2592) TT should fail task immediately if userlog dir cannot be created
Date Tue, 14 Jun 2011 18:58:47 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049336#comment-13049336
] 

Esteban Gutierrez commented on MAPREDUCE-2592:
----------------------------------------------

The problem propagates very quickly to all the nodes after a single TaskTracker has reached
that state and more jobs are submitted. This problem can bring down the whole cluster since
all the TT will be blacklisted.

A sample stacktrace:

11/02/05 10:00:01 WARN mapred.JobClient: Error reading task outputhttp://dn:50060/tasklog?plaintext=true&taskid=attempt_201102050901_1000_m_000001_0&filter=stderr

11/02/05 10:00:02 INFO mapred.JobClient: Task Id : attempt_201102050901_1000_m_000001_0, Status
: FAILED 
java.lang.Throwable: Child Error 
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:471) 
Caused by: java.io.IOException: Task process exit with nonzero status of 1. 
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:458)



> TT should fail task immediately if userlog dir cannot be created
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-2592
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2592
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: tasktracker
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>             Fix For: 0.23.0
>
>
> Currently, TaskRunner will log the message "mkdirs failed. Ignoring" if it fails to mkdir
the userlog directory for a task. Then, it goes on to spawn taskjvm.sh which tries to redirect
output into the userlogs dir, thus failing with exit code 1. This leads to error messages
that are very hard to diagnose ("task failed with exit status 1") in cases where the userlog
directory has either become inaccessible or has reached the maximum number of dirents (32000
in ext3)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message