hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Molina <rmol...@hortonworks.com>
Subject Re: JobCache directory cleanup
Date Wed, 09 Jan 2013 21:57:29 GMT
Hi Ivan,
Regarding the mapreduce.jobtracker.retiredjobs.cache.size property, the
jobtracker keeps information about a number of completed jobs in memory.
There's a threshold for this, which is a single day by default - as well as
a certain number of jobs per user. Once these limits are hit, the job is
moved into the retired job cache.  Both are used for the UI as well as to
answer RPC requests from the client - like getJobStatus, or getCounters.
Once a job goes out of the retired job cache, it's not available via RPC.

Hope that helps.

Regards,
Robert

On Wed, Jan 9, 2013 at 7:22 AM, Ivan Tretyakov
<itretyakov@griddynamics.com>wrote:

> Thanks a lot Alexander!
>
> What is mapreduce.jobtracker.retiredjobs.cache.size for?
> Does cron approach safe for hadoop? Is that only way at the moment?
>
>
> On Wed, Jan 9, 2013 at 6:50 PM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
>> Hi,
>>
>> Per default (and not configurable) the logs will be persist for 30 days.
>> This will be configurable in future (
>> https://issues.apache.org/jira/browse/MAPREDUCE-4643).
>>
>> - Alex
>>
>> On Jan 9, 2013, at 3:41 PM, Ivan Tretyakov <itretyakov@griddynamics.com>
>> wrote:
>>
>> > Hello!
>> >
>> > I've found that jobcache directory became very large on our cluster,
>> e.g.:
>> >
>> > # du -sh /data?/mapred/local/taskTracker/user/jobcache
>> > 465G    /data1/mapred/local/taskTracker/user/jobcache
>> > 464G    /data2/mapred/local/taskTracker/user/jobcache
>> > 454G    /data3/mapred/local/taskTracker/user/jobcache
>> >
>> > And it stores information for about 100 jobs:
>> >
>> > # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort |
>> uniq |
>> > wc -l
>> >
>> > I've found that there is following parameter:
>> >
>> > <property>
>> >  <name>mapreduce.jobtracker.retiredjobs.cache.size</name>
>> >  <value>1000</value>
>> >  <description>The number of retired job status to keep in the cache.
>> >  </description>
>> > </property>
>> >
>> > So, if I got it right it intended to control job cache size by limiting
>> > number of jobs to store cache for.
>> >
>> > Also, I've seen that some hadoop users uses cron approach to cleanup
>> > jobcache:
>> >
>> http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually
>> > (
>> >
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3C99484d561002100143s4404df98qead8f2cf687a76d0@mail.gmail.com%3E
>> > )
>> >
>> > Are there other approaches to control jobcache size?
>> > What is more correct way to do it?
>> >
>> > Thanks in advance!
>> >
>> > P.S. We are using CDH 4.1.1.
>> >
>> > --
>> > Best Regards
>> > Ivan Tretyakov
>> >
>> > Deployment Engineer
>> > Grid Dynamics
>> > +7 812 640 38 76
>> > Skype: ivan.tretyakov
>> > www.griddynamics.com
>> > itretyakov@griddynamics.com
>>
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>
>
>
> --
> Best Regards
> Ivan Tretyakov
>
> Deployment Engineer
> Grid Dynamics
> +7 812 640 38 76
> Skype: ivan.tretyakov
> www.griddynamics.com
> itretyakov@griddynamics.com
>

Mime
View raw message