hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Ervits" <are9...@nyp.org>
Subject Re: JobCache directory cleanup
Date Thu, 10 Jan 2013 14:38:52 GMT
As soon as job completes, your jobcache should be cleared. Check your mapred-site.xml for mapred.local.dir
setting and make sure job cleanup step is successful in web UI. Setting your job's intermediate
output setting to true will keep the jobcache folder smaller.



Artem Ervits
Data Analyst
New York Presbyterian Hospital

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
Sent: Thursday, January 10, 2013 07:37 AM
To: user@hadoop.apache.org <user@hadoop.apache.org>
Subject: Re: JobCache directory cleanup

Hi,

On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com<mailto:itretyakov@griddynamics.com>>
wrote:
Thanks for replies!

Hemanth,
I could see following exception in TaskTracker log: https://issues.apache.org/jira/browse/MAPREDUCE-5
But I'm not sure if it is related to this issue.

> Now, when a job completes, the directories under the jobCache must get automatically
cleaned up. However it doesn't look like this is happening in your case.

So, If I've no running jobs, jobcache directory should be empty. Is it correct?


That is correct. I just verified it with my Hadoop 1.0.2 version

Thanks
Hemanth



On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala <yhemanth@thoughtworks.com<mailto:yhemanth@thoughtworks.com>>
wrote:
Hi,

The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/.
This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks
are run on the slaves.

Hence, I don't think this is related to the parameter "mapreduce.jobtracker.retiredjobs.cache.size",
which is a parameter related to the jobtracker process.

Now, when a job completes, the directories under the jobCache must get automatically cleaned
up. However it doesn't look like this is happening in your case.

Could you please look at the logs of the tasktracker machine where this has gotten filled
up to see if there are any errors that could give clues ?
Also, since this is a CDH release, it could be a problem specific to that - and maybe reaching
out on the CDH mailing lists will also help

Thanks
hemanth

On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov <itretyakov@griddynamics.com<mailto:itretyakov@griddynamics.com>>
wrote:
Hello!

I've found that jobcache directory became very large on our cluster, e.g.:

# du -sh /data?/mapred/local/taskTracker/user/jobcache
465G    /data1/mapred/local/taskTracker/user/jobcache
464G    /data2/mapred/local/taskTracker/user/jobcache
454G    /data3/mapred/local/taskTracker/user/jobcache

And it stores information for about 100 jobs:

# ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq | wc -l

I've found that there is following parameter:

<property>
  <name>mapreduce.jobtracker.retiredjobs.cache.size</name>
  <value>1000</value>
  <description>The number of retired job status to keep in the cache.
  </description>
</property>

So, if I got it right it intended to control job cache size by limiting number of jobs to
store cache for.

Also, I've seen that some hadoop users uses cron approach to cleanup jobcache: http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually
(http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3C99484d561002100143s4404df98qead8f2cf687a76d0@mail.gmail.com%3E)

Are there other approaches to control jobcache size?
What is more correct way to do it?

Thanks in advance!

P.S. We are using CDH 4.1.1.

--
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.tretyakov
www.griddynamics.com<http://www.griddynamics.com>
itretyakov@griddynamics.com<mailto:itretyakov@griddynamics.com>




--
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.tretyakov
www.griddynamics.com<http://www.griddynamics.com>
itretyakov@griddynamics.com<mailto:itretyakov@griddynamics.com>



--------------------

This electronic message is intended to be for the use only of the named recipient, and may
contain information that is confidential or privileged.  If you are not the intended recipient,
you are hereby notified that any disclosure, copying, distribution or use of the contents
of this message is strictly prohibited.  If you have received this message in error or are
not the named recipient, please notify us immediately by contacting the sender at the electronic
mail address noted above, and delete and destroy all copies of this message.  Thank you.




--------------------

This electronic message is intended to be for the use only of the named recipient, and may
contain information that is confidential or privileged.  If you are not the intended recipient,
you are hereby notified that any disclosure, copying, distribution or use of the contents
of this message is strictly prohibited.  If you have received this message in error or are
not the named recipient, please notify us immediately by contacting the sender at the electronic
mail address noted above, and delete and destroy all copies of this message.  Thank you.




Mime
View raw message