hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: reduce task failing after 24 hours waiting
Date Fri, 27 Mar 2009 05:44:00 GMT
mapred.jobtracker.retirejob.interval
is not in the default config

should this not be in the config?

Billy



"Amar Kamat" <amarrk@yahoo-inc.com> wrote in 
message news:49CAFF11.8070400@yahoo-inc.com...
> Amar Kamat wrote:
>> Amareshwari Sriramadasu wrote:
>>> Set mapred.jobtracker.retirejob.interval
>> This is used to retire completed jobs.
>>> and mapred.userlog.retain.hours to higher value.
>> This is used to discard user logs.
> As Amareshwari pointed out, this might be the cause. Can you increase this 
> value and try?
> Amar
>>> By default, their values are 24 hours. These might be the reason for 
>>> failure, though I'm not sure.
>>>
>>> Thanks
>>> Amareshwari
>>>
>>> Billy Pearson wrote:
>>>> I am seeing on one of my long running jobs about 50-60 hours that after 
>>>> 24 hours all
>>>> active reduce task fail with the error messages
>>>>
>>>> java.io.IOException: Task process exit with nonzero status of 255.
>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>>
>>>> Is there something in the config that I can change to stop this?
>>>>
>>>> Every time with in 1 min of 24 hours they all fail at the same time.
>>>> waist a lot of resource downloading the map outputs and merging them 
>>>> again.
>> What is the state of the reducer (copy or sort)? Check 
>> jobtracker/task-tracker logs to see what is the state of these reducers 
>> and whether it issued a kill signal. Either jobtracker/tasktracker is 
>> issuing a kill signal or the reducers are committing suicide. Were there 
>> any failures on the reducer side while pulling the map output? Also what 
>> is the nature of the job? How fast the maps finish?
>> Amar
>>>>
>>>> Billy
>>>>
>>>>
>>>
>>
>
> 



Mime
View raw message