hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "agile.java@gmail.com" <agile.j...@gmail.com>
Subject Re: Jobtracker memory issues due to FileSystem$Cache
Date Sat, 27 Apr 2013 03:07:20 GMT
We meet the same problem, I haven't found the reason,I'm debugging it.


On Wed, Apr 17, 2013 at 11:14 PM, Marcin Mejran <marcin.mejran@hooklogic.com
> wrote:

>  In case anyone is wondering, I tracked this down to a race condition in
> JobInProgress or failure to clean up FileSystems in CleanupQueue (depending
> on how you look at it). ****
>
> ** **
>
> FileSystem.closeAllForUGI is what keeps the cache from memory leaking
> however it’s not called in one thread. However JobInProgress calls
> closeAllForUGI  on a UGI that was also passed to the CleanupQueue thread.
> If closeAllForUGI is called by JobInProgress before CleanupQueue calls
> FileSystem.get with that ugi then there’s a leak. Since CleanupQueue
> doesn’t call closeAllForUGI the filesystem is left cached perpetually.****
>
> ** **
>
> Setting, for example, keep.failed.task.files=true or
> keep.task.files.pattern=<dummy text> prevents CleanupQueue from getting
> called which seems to solve my issues. You get junk left in .staging but
> that can be dealt with.****
>
> ** **
>
> -Marcin****
>
> ** **
>
> *From:* Marcin Mejran [mailto:marcin.mejran@hooklogic.com]
> *Sent:* Tuesday, April 16, 2013 1:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Jobtracker memory issues due to FileSystem$Cache****
>
> ** **
>
> We’ve recently run into jobtracker memory issues on our new hadoop
> cluster. A heap dump shows that there are thousands of copies of
> DistributedFileSystem kept in FileSystem$Cache, a bit over one for each job
> run on the cluster and their jobconf objects support this view. I believe
> these are created when the .staging directories get cleaned up but I may be
> wrong on that.****
>
> ** **
>
> From what I can tell in the dump, the username (probably not ugi, hard to
> tell), scheme and authority parts of the Cache$Key are the same across
> multiple objects in FileSystem$Cache. I can only assume that the
> usergroupinformation piece differs somehow every time it’s created.****
>
> ** **
>
> We’re using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and so
> on are not enabled. ****
>
> ** **
>
> Is there any known reason for this type of behavior?****
>
> ** **
>
> Thanks,****
>
> -Marcin****
>



-- 
d0ngd0ng

Mime
View raw message