hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Mejran <marcin.mej...@hooklogic.com>
Subject RE: Jobtracker memory issues due to FileSystem$Cache
Date Wed, 17 Apr 2013 15:14:18 GMT
In case anyone is wondering, I tracked this down to a race condition in JobInProgress or failure
to clean up FileSystems in CleanupQueue (depending on how you look at it).

FileSystem.closeAllForUGI is what keeps the cache from memory leaking however it's not called
in one thread. However JobInProgress calls closeAllForUGI  on a UGI that was also passed to
the CleanupQueue thread. If closeAllForUGI is called by JobInProgress before CleanupQueue
calls FileSystem.get with that ugi then there's a leak. Since CleanupQueue doesn't call closeAllForUGI
the filesystem is left cached perpetually.

Setting, for example, keep.failed.task.files=true or keep.task.files.pattern=<dummy text>
prevents CleanupQueue from getting called which seems to solve my issues. You get junk left
in .staging but that can be dealt with.


From: Marcin Mejran [mailto:marcin.mejran@hooklogic.com]
Sent: Tuesday, April 16, 2013 1:47 PM
To: user@hadoop.apache.org
Subject: Jobtracker memory issues due to FileSystem$Cache

We've recently run into jobtracker memory issues on our new hadoop cluster. A heap dump shows
that there are thousands of copies of DistributedFileSystem kept in FileSystem$Cache, a bit
over one for each job run on the cluster and their jobconf objects support this view. I believe
these are created when the .staging directories get cleaned up but I may be wrong on that.

>From what I can tell in the dump, the username (probably not ugi, hard to tell), scheme
and authority parts of the Cache$Key are the same across multiple objects in FileSystem$Cache.
I can only assume that the usergroupinformation piece differs somehow every time it's created.

We're using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and so on are not enabled.

Is there any known reason for this type of behavior?


View raw message