hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5591) mapred.jobtracker.retirejob.interval killing long running reduce task
Date Mon, 30 Mar 2009 09:34:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693703#action_12693703
] 

Amareshwari Sriramadasu commented on HADOOP-5591:
-------------------------------------------------

I dont think mapred.jobtracker.retirejob.interval is killing long running reduce task, since
mapred.jobtracker.retirejob.interval retires only completed jobs. It is mapred.userlog.retain.hours
configuration, whose default is set to 24 hours, killing long running task. I could reproduce
the scenario you explained, by configuring mapred.userlog.retain.hours to 1hr, and running
tasks more than an hour. 

> mapred.jobtracker.retirejob.interval killing long running reduce task
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-5591
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5591
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.2
>         Environment: 0.19.2-dev, r753365 
>            Reporter: Billy Pearson
>
> I have long running jobs that run 30-50 hours I run from time to time . I noticed the
reduce jobs getting a WARN child error and failing every 24 hours while in the Shuffle stage.
> I modify the setting per suggestion on the user-list of setting mapred.jobtracker.retirejob.interval
and changed it from 24 hours to 72 and the problem went away on the next 30 hour job.
> I seen a reduce task run for longer then the 24 hours but only if it does not stay in
the Shuffle stage or the Sort stage for longer then 24 hours.
> I have seen the same error from faild task that reamin in the Shuffle or Sort Stage for
longer then 24 hours.
> the error I get form the jobtracker gui is this
> java.io.IOException: Task process exit with nonzero status of 255.
>  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> the error I get on the tasktracker logs is this:
> 2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: 
> attempt_200903212204_0005_r_000001_1 Child Error
> Then clean up happens and a reduce task is launched again to try again.
> I am not 100% sure what the setting mapred.jobtracker.retirejob.interval does but I would
not thank any setting would kill a actively NOT idle Sorting or Shuffle task
> also someone on the list ask about my maps if they where long running also they are not
long running average 4 mins completion time a map.
> Also mapred.jobtracker.retirejob.interval is not in the default config but the code looks
for it there when setting it in the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message