hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5591) mapred.jobtracker.retirejob.interval killing long running reduce task
Date Sun, 29 Mar 2009 05:10:51 GMT
mapred.jobtracker.retirejob.interval killing long running reduce task

                 Key: HADOOP-5591
                 URL: https://issues.apache.org/jira/browse/HADOOP-5591
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.19.2
         Environment: 0.19.2-dev, r753365 
            Reporter: Billy Pearson

I have long running jobs that run 30-50 hours I run from time to time . I noticed the reduce
jobs getting a WARN child error and failing every 24 hours while in the Shuffle stage.
I modify the setting per suggestion on the user-list of setting mapred.jobtracker.retirejob.interval
and changed it from 24 hours to 72 and the problem went away on the next 30 hour job.

I seen a reduce task run for longer then the 24 hours but only if it does not stay in the
Shuffle stage or the Sort stage for longer then 24 hours.
I have seen the same error from faild task that reamin in the Shuffle or Sort Stage for longer
then 24 hours.

the error I get form the jobtracker gui is this
java.io.IOException: Task process exit with nonzero status of 255.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

the error I get on the tasktracker logs is this:
2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_000001_1 Child Error

Then clean up happens and a reduce task is launched again to try again.

I am not 100% sure what the setting mapred.jobtracker.retirejob.interval does but I would
not thank any setting would kill a actively NOT idle Sorting or Shuffle task
also someone on the list ask about my maps if they where long running also they are not long
running average 4 mins completion time a map.

Also mapred.jobtracker.retirejob.interval is not in the default config but the code looks
for it there when setting it in the code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message