hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@thoughtworks.com>
Subject Re: JobCache directory cleanup
Date Thu, 10 Jan 2013 12:37:06 GMT
Hi,

On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Thanks for replies!
>
> Hemanth,
> I could see following exception in TaskTracker log:
> https://issues.apache.org/jira/browse/MAPREDUCE-5
> But I'm not sure if it is related to this issue.
>
> > Now, when a job completes, the directories under the jobCache must get
> automatically cleaned up. However it doesn't look like this is happening in
> your case.
>
> So, If I've no running jobs, jobcache directory should be empty. Is it
> correct?
>
>
That is correct. I just verified it with my Hadoop 1.0.2 version

Thanks
Hemanth


>
>
> On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Hi,
>>
>> The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/.
>> This directory is used by the TaskTracker (slave) daemons to localize job
>> files when the tasks are run on the slaves.
>>
>> Hence, I don't think this is related to the parameter "
>> mapreduce.jobtracker.retiredjobs.cache.size", which is a parameter
>> related to the jobtracker process.
>>
>> Now, when a job completes, the directories under the jobCache must get
>> automatically cleaned up. However it doesn't look like this is happening in
>> your case.
>>
>> Could you please look at the logs of the tasktracker machine where this
>> has gotten filled up to see if there are any errors that could give clues ?
>> Also, since this is a CDH release, it could be a problem specific to that
>> - and maybe reaching out on the CDH mailing lists will also help
>>
>> Thanks
>> hemanth
>>
>> On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov <
>> itretyakov@griddynamics.com> wrote:
>>
>>> Hello!
>>>
>>> I've found that jobcache directory became very large on our cluster,
>>> e.g.:
>>>
>>> # du -sh /data?/mapred/local/taskTracker/user/jobcache
>>> 465G    /data1/mapred/local/taskTracker/user/jobcache
>>> 464G    /data2/mapred/local/taskTracker/user/jobcache
>>> 454G    /data3/mapred/local/taskTracker/user/jobcache
>>>
>>> And it stores information for about 100 jobs:
>>>
>>> # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq
>>> | wc -l
>>>
>>> I've found that there is following parameter:
>>>
>>> <property>
>>>   <name>mapreduce.jobtracker.retiredjobs.cache.size</name>
>>>   <value>1000</value>
>>>   <description>The number of retired job status to keep in the cache.
>>>   </description>
>>> </property>
>>>
>>> So, if I got it right it intended to control job cache size by limiting
>>> number of jobs to store cache for.
>>>
>>> Also, I've seen that some hadoop users uses cron approach to cleanup
>>> jobcache:
>>> http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually
>>>  (
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3C99484d561002100143s4404df98qead8f2cf687a76d0@mail.gmail.com%3E
>>> )
>>>
>>> Are there other approaches to control jobcache size?
>>> What is more correct way to do it?
>>>
>>> Thanks in advance!
>>>
>>> P.S. We are using CDH 4.1.1.
>>>
>>> --
>>> Best Regards
>>> Ivan Tretyakov
>>>
>>> Deployment Engineer
>>> Grid Dynamics
>>> +7 812 640 38 76
>>> Skype: ivan.tretyakov
>>> www.griddynamics.com
>>> itretyakov@griddynamics.com
>>>
>>
>>
>
>
> --
> Best Regards
> Ivan Tretyakov
>
> Deployment Engineer
> Grid Dynamics
> +7 812 640 38 76
> Skype: ivan.tretyakov
> www.griddynamics.com
> itretyakov@griddynamics.com
>

Mime
View raw message