hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: Map File index bug?
Date Thu, 06 Nov 2008 16:08:30 GMT
yes I just hacked so I could see what size it was with out compression so I 
could compair with what its takeing in memory

Billy

"stack" <stack@duboce.net> wrote in message 
news:7c962aed0811060802k645d8ff0r1ddc7be94b74dba8@mail.gmail.com...
> But its decompressed when its read into memory right?   So size will be 
> the
> same in memory whether it was compressed in the filesystem, or not?  Or am 
> I
> missing something Billy?
> St.Ack
>
> On Thu, Nov 6, 2008 at 7:55 AM, Billy Pearson 
> <sales@pearsonwholesale.com>wrote:
>
>> There is no method to change the compression of the index its just always
>> block compressed.
>> I hacked the code and and changed to non compressed so I could get a size
>> of the index with out compression.
>> Opening the all 80 mapfiles took 4x the memory then there uncompressed 
>> size
>> of all the index files.
>>
>>
>> "stack" <stack@duboce.net> wrote in message
>> news:7c962aed0811060026j660e4d87hfe3fc0ce7895ff7e@mail.gmail.com...
>>
>>  On Wed, Nov 5, 2008 at 11:52 PM, Billy Pearson
>>> <sales@pearsonwholesale.com>wrote:
>>>
>>>>
>>>>
>>>> I ran a job on 80 mapfile to write 80 new file with non compressed
>>>> indexes
>>>> and still took ~4X the memory of the sizes of the uncompressed index
>>>> files
>>>> to load in to memory
>>>>
>>>
>>>
>>> Sorry Billy, how did you specify non-compressed indices?  What took 4X
>>> memory?  The non-compressed index?
>>>
>>>
>>> could have to do with the way they grow the arrays storing the pos of 
>>> the
>>>
>>>> keys starting on line 333
>>>> Looks like they are copying arrays and making a new one 150% bigger 
>>>> then
>>>> the last as needed.
>>>> not sure about java and how long before the old array will be recovered
>>>> from memory.
>>>>
>>>
>>> I have seen a few times recover do to about ~2x the size of the
>>> uncompressed
>>>
>>>> index files but only twice.
>>>>
>>>>
>>>
>>> Unreferenced java objects will be let go variously.  Depends  on your 
>>> JVM
>>> configuration.  Usually they'll be let go when JVM needs the memory 
>>> (Links
>>> like this may be of help:
>>>
>>> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
>>> )
>>>
>>>
>>>
>>>  I am testing by creating the files with a MR job and then loading the 
>>> map
>>>> files in a simple
>>>> program to open the files and find midkey so the index gets read in to
>>>> memory and watching top command
>>>> also added -Xloggc:/tmp/gc.log and watch the memory usage go up and it
>>>> matches for the most part with top.
>>>>
>>>> I tried running System.gc() to force a clean up of the memory but did 
>>>> not
>>>> seam to help any.
>>>>
>>>>
>>> Yeah, its just a suggestion.  The gc.log should give you better clue of
>>> whats going on.  Whats it saying?  Lots of small gcs and then a Fulll gc
>>> every so often?  Is the heap discernibly growing?  You could enable the
>>> JMX
>>> for the JVM and connect with jconsole.  This can give you a more 
>>> detailed
>>> picture on heap.
>>>
>>> St.Ack
>>> P.S. Check out HBASE-722 if you have a sec.
>>>
>>>
>>>
>>>  Billy
>>>>
>>>>
>>>> "Billy Pearson" 
>>>> <sales@pearsonwholesale.com> wrote in 
>>>> message
>>>> news:ger0jq$800$1@ger.gmane.org...
>>>>
>>>>  I been looking over the MapFile class on hadoop for memory problems 
>>>> and
>>>>
>>>>> thank I might have found an index bug
>>>>>
>>>>> org.apache.hadoop.io.MapFile
>>>>> line 202
>>>>> if (size % indexInterval == 0) {            // add an index entry
>>>>>
>>>>> this is where its writing the index and skipping every indexInterval
>>>>> rows
>>>>>
>>>>> then on the loading of the index
>>>>> line 335
>>>>>
>>>>>        if (skip > 0) {
>>>>>          skip--;
>>>>>          continue;                             // skip this entry
>>>>>
>>>>> we are only reading in every skip entry
>>>>>
>>>>> so with the default of 32 I thank in hbase we are only writing a index
>>>>> to
>>>>> the indexfile every 32 rows and then only reading back every 32 rows

>>>>> of
>>>>> that
>>>>>
>>>>> so we only get a index row every 1024 rows.
>>>>>
>>>>> Take a look and confirm and we can open a bug on hadoop about it.
>>>>>
>>>>> Billy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
> 



Mime
View raw message