hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: prefix compression
Date Sat, 04 Jun 2011 07:18:19 GMT
I varied the ms increment randomly between 1-20, then created 10 mil
dates.  The FST was then 58,481,582 bytes, eg, 57 MB.  Guess it's not
perfect!  19,739,994 bytes, eg, 18.8 MB for random 1-5 increments.  I
think that's still pretty good.  I need to try varying the long value
stored alongside to see how much that affects the total size.

How many keys are there for a typical region, and total per region
server?  One could try MMap'ing the FST however because access to it
is completely random, there's a high risk of page faults.

If a key set is sequentially incremented by 1, then that's clearly
optimal.  I'd guess that's a fairly common case though?

> How would you search a 'key' in the FST?

Searching on a key is via BytesRefFSTEnum.seekCeil method, which'll
match anything greater than or equal to the given parameter.

Clearly more benchmarking is required, however this would seem to be a
highly promising avenue of research for HBase.

On Fri, Jun 3, 2011 at 10:57 PM, Stack <stack@duboce.net> wrote:
> That can't be true?  (smile)  How would you search a 'key' in the FST?
> St.Ack
> On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
>> Yeah it's truly super wild!  Here's the code: http://pastebin.com/bnB53UQz
>> You can see the line that's adding the string:
>> fstBuilder.add(new BytesRef(date), new Long(x));
>> On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>>> Jason - are you feeding it that whole string for each date?  Input data is
>>> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes?
>>>  Is it possible to compress by that much?  Maybe I'm missing something about
>>> how the FST works.
>>> Matt
>>> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>>>> wrote:
>>>> Also the next thing to measure with the FST is the key lookup speed.
>>>> I'm not sure what that'd look like, or how to compare with HBase right
>>>> now?
>>>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen
>>>> <jason.rutherglen@gmail.com> wrote:
>>>> > Here's a nice preliminary number with the FST, 50 million dates of the
>>>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond.  The
>>>> > FST is 984 bytes, with an incrementing long to point to the presumably
>>>> > MMap'd value data.  This's a bit crazy.
>>>> >
>>>> > Perhaps we should try other increments as well?  Given that HBase keys
>>>> > especially are probably close increments of each other, I think the
>>>> > FST can always be loaded into RAM with pointers out to the actual
>>>> > values.
>>>> >

View raw message