hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2846) approx 10% of all tasks fail with DefaultTaskController
Date Thu, 18 Aug 2011 22:00:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087331#comment-13087331
] 

Allen Wittenauer commented on MAPREDUCE-2846:
---------------------------------------------

Some relevant properties:

 <property>
    <name>mapred.local.dir</name>
    <value>/grid/a/mapred/local,/grid/b/mapred/local,/grid/c/mapred/local,/grid/d/mapred/local,/grid/e/mapred/local,/grid/f/mapred/local</value>
  </property>

  <property>
    <name>hadoop.job.history.user.location</name>
    <value>none</value>
    <final>true</final>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>/grid/a/mapred/tmp/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    <final>true</final>
  </property>

The permissions on these dirs are 775.  User and group match the user we run the tasktracker
as.  (So, with DefaultTaskController, this should work just fine.)

Some other questions I've been asked over IM:

* Nodes can show failures with one run, be perfectly clean the next, then show failures during
a third run.  Some nodes will throw failures during all three.
* This problem is reflected in both map tasks and reduce tasks.
* The dir permissions really are the same across all dirs and all nodes. :)
* I have not tried LTC because my test grid is not configured to support it yet.
* I've been testing the Apache releases with no custom patches other than including the LZO
bits.
* The number of failures per run is wildly inconsistent.
* Running 203 on the same gear with the same config shows zero failures.  So this is clearly
a result of something added in 204.
* Yes, enough tasks have failed during certain runs that tasktrackers are getting blacklisted
from the job.

I'm currently playing with a debug jar from Owen to try and gather more information.  Part
of the problem is that there clearly isn't enough information on why tasks are failing.  The
tasktracker logs throw the symlink error but see MAPREDUCE-2804.  The child error stack trace:

{code}
java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of -1.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
{code}

is equally unhelpful.

> approx 10% of all tasks fail with DefaultTaskController
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-2846
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2846
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task, task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Allen Wittenauer
>            Priority: Blocker
>
> After upgrading our test 0.20.203 grid to 0.20.204-rc2, we ran terasort to verify operation.
 While the job completed successfully, approx 10% of the tasks failed with task runner execution
errors and the inability to create symlinks for attempt logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message