hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "agile.java@gmail.com" <agile.j...@gmail.com>
Subject Re: Jobtracker memory issues due to FileSystem$Cache
Date Sat, 27 Apr 2013 10:40:06 GMT
The reason you described is true and I verified it  at our enviroment,thank
you very much.
I try to set keep.failed.task.files=true,but all jobs failed due to
MAPREDUCE-5047 <https://issues.apache.org/jira/browse/MAPREDUCE-5047> ,because
our hadoop cluster turn on the  kerberos. :(
The only thing we can do now is to restart the jobtracker at timing before
the  CDH 4.3 release.
Do you know any other solution?
Thanks.





On Sat, Apr 27, 2013 at 11:07 AM, agile.java@gmail.com <agile.java@gmail.com
> wrote:

> We meet the same problem, I haven't found the reason,I'm debugging it.
>
>
> On Wed, Apr 17, 2013 at 11:14 PM, Marcin Mejran <
> marcin.mejran@hooklogic.com> wrote:
>
>>  In case anyone is wondering, I tracked this down to a race condition in
>> JobInProgress or failure to clean up FileSystems in CleanupQueue (depending
>> on how you look at it). ****
>>
>> ** **
>>
>> FileSystem.closeAllForUGI is what keeps the cache from memory leaking
>> however it’s not called in one thread. However JobInProgress calls
>> closeAllForUGI  on a UGI that was also passed to the CleanupQueue thread.
>> If closeAllForUGI is called by JobInProgress before CleanupQueue calls
>> FileSystem.get with that ugi then there’s a leak. Since CleanupQueue
>> doesn’t call closeAllForUGI the filesystem is left cached perpetually.***
>> *
>>
>> ** **
>>
>> Setting, for example, keep.failed.task.files=true or
>> keep.task.files.pattern=<dummy text> prevents CleanupQueue from getting
>> called which seems to solve my issues. You get junk left in .staging but
>> that can be dealt with.****
>>
>> ** **
>>
>> -Marcin****
>>
>> ** **
>>
>> *From:* Marcin Mejran [mailto:marcin.mejran@hooklogic.com]
>> *Sent:* Tuesday, April 16, 2013 1:47 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Jobtracker memory issues due to FileSystem$Cache****
>>
>> ** **
>>
>> We’ve recently run into jobtracker memory issues on our new hadoop
>> cluster. A heap dump shows that there are thousands of copies of
>> DistributedFileSystem kept in FileSystem$Cache, a bit over one for each job
>> run on the cluster and their jobconf objects support this view. I believe
>> these are created when the .staging directories get cleaned up but I may be
>> wrong on that.****
>>
>> ** **
>>
>> From what I can tell in the dump, the username (probably not ugi, hard to
>> tell), scheme and authority parts of the Cache$Key are the same across
>> multiple objects in FileSystem$Cache. I can only assume that the
>> usergroupinformation piece differs somehow every time it’s created.****
>>
>> ** **
>>
>> We’re using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and
>> so on are not enabled. ****
>>
>> ** **
>>
>> Is there any known reason for this type of behavior?****
>>
>> ** **
>>
>> Thanks,****
>>
>> -Marcin****
>>
>
>
>
> --
> d0ngd0ng
>



-- 
d0ngd0ng

Mime
View raw message