hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marc Limotte (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3551) Loaded hfile indexes occupy a good chunk of heap; look into shrinking the amount used and/or evicting unused indices
Date Thu, 10 Mar 2011 19:28:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272
] 

Marc Limotte commented on HBASE-3551:
-------------------------------------

I understand this better now.  I did some poking around with the HFile tool.  Average key
length does seem to be around 150 bytes, as I estimated.
 
For one hfile /hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata
is:

avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343
fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, metaIndexOffset=0,
metaIndexCount=0, totalBytes=8653853680, entryCount=49285512, version=1

Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb

Index data per Region Server = 22mb * 180 regions = almost 4gb.  Plus the other column family,
so this does seem to add up to the 5 to 6gb of HEAP we are seeing.

# of entries per dataindex entry = 49285512 / 131869 = 374
Times the key size (avg 157 bytes for this file) = 59k (close to the block size of 64k). 
So, seems to make sense.

I also looked at the keyvalue pairs using the HFile tool (a section of output is below).

We have a few billion rows (2 - 4 billion).  I haven't done a full row count.

What I didn't understand previously is that it's not 374 rows, but 374 "entries".  An entry
means a single column entry and the key is repeated for each column value.  Given our fairly
large key, that would add up quickly.

Solutions
1) Increase the hbase block size (I did this and it resolved our situation for now)  
2) Modifying our schema to use smaller keys - perhaps IDs instead of string names.
3) Modifying our schema to have fewer columns - we could combine several related columns into
one compound value.
4) An LRU cache for storefile indexes

Given the other options, #4 may not be warranted, so I think we can close this issue.


> Loaded hfile indexes occupy a good chunk of heap; look into shrinking the amount used
and/or evicting unused indices
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3551
>                 URL: https://issues.apache.org/jira/browse/HBASE-3551
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: stack
>
> I hung with a user Marc and we were looking over configs and his cluster profile up on
ec2.  One thing we noticed was that his 100+ 1G regions of two families had ~2.5G of heap
resident.  We did a bit of math and couldn't get to 2.5G so that needs looking into.  Even
still, 2.5G is a bunch of heap to give over to indices (He actually OOME'd when he had his
RS heap set to just 3G; we shouldn't OOME, we should just run slower).  It sounds like he
needs the indices loaded but still, for some cases we should drop indices for unaccessed files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message