hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5591) mapred.jobtracker.retirejob.interval killing long running reduce task
Date Mon, 30 Mar 2009 14:14:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693780#action_12693780

Billy Pearson commented on HADOOP-5591:

thanks for that I forgot that I did change that setting too should I create a new issue or
edit this one?

> mapred.jobtracker.retirejob.interval killing long running reduce task
> ---------------------------------------------------------------------
>                 Key: HADOOP-5591
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5591
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.2
>         Environment: 0.19.2-dev, r753365 
>            Reporter: Billy Pearson
> I have long running jobs that run 30-50 hours I run from time to time . I noticed the
reduce jobs getting a WARN child error and failing every 24 hours while in the Shuffle stage.
> I modify the setting per suggestion on the user-list of setting mapred.jobtracker.retirejob.interval
and changed it from 24 hours to 72 and the problem went away on the next 30 hour job.
> I seen a reduce task run for longer then the 24 hours but only if it does not stay in
the Shuffle stage or the Sort stage for longer then 24 hours.
> I have seen the same error from faild task that reamin in the Shuffle or Sort Stage for
longer then 24 hours.
> the error I get form the jobtracker gui is this
> java.io.IOException: Task process exit with nonzero status of 255.
>  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> the error I get on the tasktracker logs is this:
> 2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: 
> attempt_200903212204_0005_r_000001_1 Child Error
> Then clean up happens and a reduce task is launched again to try again.
> I am not 100% sure what the setting mapred.jobtracker.retirejob.interval does but I would
not thank any setting would kill a actively NOT idle Sorting or Shuffle task
> also someone on the list ask about my maps if they where long running also they are not
long running average 4 mins completion time a map.
> Also mapred.jobtracker.retirejob.interval is not in the default config but the code looks
for it there when setting it in the code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message