hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: running out of memory. Reasons?
Date Mon, 27 Sep 2010 21:24:45 GMT
The short answer is: because you have regions that haven't flushed yet
that the oldest hlogs still have edits from. 532 regions is part of
the reason, and I guess you are doing some importing so the updates
must be spread through a lot of them.

But, let's run some maths. 32 HLogs of ~64MBs each is about 2GB
whereas each region will flush when it gets 64MB so since you have 532
of them and guessing that your loading pattern is a bit random then
it'd take 33GB of RAM to hold everything before it all starts
flushing. Also there's a global memstore max size which is 40%
(default) so since you gave 5000MB of heap, this means that you cannot
have more than 2000MB of data in all the memstores inside each region
server. This is actually great, because 32 HLogs together is about
that same size, but where everything gets screwed up is with the total
number of regions getting loaded.

So, you can set the max number of HLogs higher, but you still have the
same amount of memory so you'll run into the max global memstore size
instead of max hlogs files which will still have the effect of force
flushing small regions (which triggers compactions, and everything
becomes way less efficient than it is designed to). I still cannot
explain how you ended up with 934 store files to compact, but you
should definitely take great care of getting that number of regions /
region server down to a more manageable level. Did you play with
MAX_FILESIZE on that table?


On Mon, Sep 27, 2010 at 1:46 PM, Jack Levin <magnito@gmail.com> wrote:
> http://pastebin.com/S7ETUpSb
> and
> Too many hlogs files:
> http://pastebin.com/j3GMynww
> Why do I have so many hlogs?
> -Jack
> On Mon, Sep 27, 2010 at 1:33 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> You could set the blocking store files setting higher (we have it at
>> 17 here), but looking at the log I see it was blocking for 90secs only
>> to flush a 1MB file. Why was that flush requested? Global memstore
>> size reached? The log from a few lines before should tell
>> J-D
>> On Mon, Sep 27, 2010 at 1:18 PM, Jack Levin <magnito@gmail.com> wrote:
>>> I see it:  http://pastebin.com/tgQHBSLj
>>> Interesting situation indeed.  Any thoughts on how to avoid it?  Have
>>> compaction running more aggressively?
>>> -Jack
>>> On Mon, Sep 27, 2010 at 1:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>> Can you grep around the region server log files to see what was going
>>>> on with that region during the previous run? There's only 1 way I see
>>>> this happening, and it would require that your region server would be
>>>> serving thousands of regions and that this region was in queue to be
>>>> compacted behind all those thousands of regions, and in the mean time
>>>> the flush blocker of 90 seconds would timeout at least enough times so
>>>> that you would end up with all those store files (which according to
>>>> my quick calculation, would mean that it took about 23 hours before
>>>> the region server was able to compact that region which is something
>>>> I've never seen, and it would have killed your region server with
>>>> OOME). Do you see this message often?
>>>>       LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime)
>>>>          "ms on a compaction to clean up 'too many store files'; waited
" +
>>>>          "long enough... proceeding with flush of " +
>>>>          region.getRegionNameAsString());
>>>> Thx,
>>>> J-D
>>>> On Mon, Sep 27, 2010 at 12:54 PM, Jack Levin <magnito@gmail.com> wrote:
>>>>> Strange: this is what I have:
>>>>>  <property>
>>>>>    <name>hbase.hstore.blockingStoreFiles</name>
>>>>>    <value>7</value>
>>>>>    <description>
>>>>>    If more than this number of StoreFiles in any one Store
>>>>>    (one StoreFile is written per flush of MemStore) then updates are
>>>>>    blocked for this HRegion until a compaction is completed, or
>>>>>    until hbase.hstore.blockingWaitTime has been exceeded.
>>>>>    </description>
>>>>>  </property>
>>>>> I wonder how it got there, I've deleted the files.
>>>>> -jack
>>>>> On Mon, Sep 27, 2010 at 12:42 PM, Jean-Daniel Cryans
>>>>> <jdcryans@apache.org> wrote:
>>>>>> I'd say it's the:
>>>>>> 2010-09-27 12:16:15,291 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of
>>>>>> file(s) in att of
>>>>>> img833,dsc03711s.jpg,1285493435306.da57612ee69d7baaefe84
>>>>>> eeb0e49f240.  into
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img833/da57612ee69d7baaefe84eeb0e49f240/.tmp,
>>>>>> sequenceid=618626242
>>>>>> That killed you. I wonder how it was able to get there since the
>>>>>> Memstore blocks flushing if the upper threshold for compactions was
>>>>>> reached (default is 7, did you set it to 1000 by any chance?).
>>>>>> J-D
>>>>>> On Mon, Sep 27, 2010 at 12:29 PM, Jack Levin <magnito@gmail.com>
>>>>>>> Strange situation, cold start the cluster, and one of the servers
>>>>>>> started getting more and more consuming of RAM, you can see it
>>>>>>> the screenshot I am attaching.  Here is the log:
>>>>>>> http://pastebin.com/MDPJzLQJ
>>>>>>> There seem to be nothing happen, and then it just runs out of
>>>>>>> and of course shuts down.
>>>>>>> Here is GC log before the crash:  http://pastebin.com/GwdC3nhx
>>>>>>> Strange , that other region servers stay up and consuming little
>>>>>>> memory (or rather stay stable.).
>>>>>>> Any ideas?
>>>>>>> -Jack

View raw message