hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joep Rottinghuis <jrottingh...@gmail.com>
Subject Re: NN Memory Jumps every 1 1/2 hours
Date Sat, 22 Dec 2012 17:17:31 GMT
Do your OOMs correlate with the secondary checkpointing?


Sent from my iPhone

On Dec 22, 2012, at 7:42 AM, Michael Segel <michael_segel@hotmail.com> wrote:

> Hey Silly question... 
> How long have you had 27 million files? 
> I mean can you correlate the number of files to the spat of OOMs? 
> Even without problems... I'd say it would be a good idea to upgrade due to the probability
of a lot of code fixes... 
> If you're running anything pre 1.x, going to 1.7 java wouldn't be a good idea.  Having
said that... outside of MapR, have any of the distros certified themselves on 1.7 yet? 
> On Dec 22, 2012, at 6:54 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>> I will give this a go. I have actually went in JMX and manually triggered
>> GC no memory is returned. So I assumed something was leaking.
>> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <afaris@linkedin.com> wrote:
>>> I know this will sound odd, but try reducing your heap size.   We had an
>>> issue like this where GC kept falling behind and we either ran out of heap
>>> or would be in full gc.  By reducing heap, we were forcing concurrent mark
>>> sweep to occur and avoided both full GC and running out of heap space as
>>> the JVM would collect objects more frequently.
>>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <edlinuxguru@gmail.com>
>>> wrote:
>>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a while.
>>>> (which is why I never bothered an upgrade)
>>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We have a
>>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about
>>>> 27,000,000 files.
>>>> So the strangest thing is that every 1 and 1/2 hour the NN memory usage
>>>> increases until the heap is full.
>>>> http://imagebin.org/240287
>>>> We tried failing over the NN to another machine. We change the Java
>>> version
>>>> from 1.6_23 -> 1.7.0.
>>>> I have set the NameNode logs to debug and ALL and I have done the same
>>> with
>>>> the data nodes.
>>>> Secondary NN is running and shipping edits and making new images.
>>>> I am thinking something has corrupted the NN MetaData and after enough
>>> time
>>>> it becomes a time bomb, but this is just a total shot in the dark. Does
>>>> anyone have any interesting trouble shooting ideas?

View raw message