hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Migdal <ja...@fb.com>
Subject Re: prefix compression implementation
Date Wed, 14 Sep 2011 22:43:00 GMT

Thanks a lot for the code. Great job!

As I mentioned in JIRA I work full time on the delta encoding [1]. Right
now the code and integration is almost done. Most of the parts are under
review. Since it is a big change will plan to test it very carefully.
After that, It will be ported to trunk and open sourced.

I have a quick glimpse I have taken the different approach. I implemented
a few different algorithms which are simpler. They also aims mostly to
save space while having fast decompress/compress code. However the access
is still sequential. The goal of my project is to save some RAM by having
compressed BlockCache in memory.

On the other hand, it seems that you are most concerned about seeking

I will read your code more carefully. A quick glimpse: we both implemented
some routines (like vint), but expect that there is no overlap.

I also seen that you spend some time investigating ByteBuffer vs. Byte[].
I experienced significant negative performance impact when I switched to
ByteBuffer. However I postpone this optimization.

Right now I think the easiest way to go would be that you will implement
DeltaEncoder interface after my change:
(note, there might be some minor changes)

That way, you will reuse my integration with existing code for "free".


[1] - I prefer to call it that way. Prefix is one of the algorithm, but
there are also different approach.

On 9/13/11 1:36 AM, "Ted Yu" <yuzhihong@gmail.com> wrote:

>Thanks for the update.
>Cacheable interface is defined in:
>You can find the implementation at:
>I will browse your code later.
>On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <mcorgan@hotpads.com> wrote:
>> Hi devs,
>> I put a "developer preview" of a prefix compression algorithm on github.
>>  It
>> still needs some details worked out, a full set of iterators, about 200
>> optimizations, and a bunch of other stuff...  but, it successfully
>> some preliminary tests so I thought I'd get it in front of more eyeballs
>> sooner than later.
>> https://github.com/hotpads/hbase-prefix-trie
>> It depends on HBase's Bytes.java and KeyValue.java, which depends on
>> hadoop.
>>  Those jars are in there, along with some others for full HFile testing
>> the near future.
>> A good place to start looking at the code
>> is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder.
>> the main driver of the compaction side of things, taking KeyValues (in
>> sorted order), and generates a byte[] to be saved as a disk block.  Then
>> for
>> reading it back in, there is trie.compact.read.PtIterator which takes
>> byte[] and spits out KeyValues.  The main test class is
>> trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues to
>> make
>> sure the outputs are the same as the inputs.
>>  trie.compact.row.node.PtRowNode is the format of a single trie node in
>> underlying byte[].
>> The current version generates a lot of little objects (garbage) while
>> writing and reading.  I plan to optimize it to the point where most
>> variables are primitives on the stack (at least when reading), but I
>> these intermediate classes are important for debugging.  I'll probably
>> to keep them around going forward and develop a more compact version in
>> parallel.
>> It uses trie style compression for the row keys and column qualifiers,
>> where
>> pointers between nodes are compacted ints.  It keeps a list of
>> de-duped deltas for timestamps, and if they're all the same, it stores
>> one (long) per block.  If all KeyValue.TYPE operations are the same,
>> it
>> only stores one (byte) per block.
>> It's designed around efficient cpu cache usage and elimination of 8 byte
>> pointers, so should be fast.  Get calls can traverse the trie nodes to
>> up a row key while barely loading anything from memory to cache, as
>> to current hbase which may load the better part of a block into cache.
>>  Scanners/Filters/Comparators can all be designed to be trie-aware so
>> can iterate through 20 columnQualifiers in the same row without
>> re-scanning the rowKey bytes... etc...
>> Here are a few questions we'll need to answer at some point:
>> * where should this code go?
>>    - i'd vote for keeping it isolated and limiting references back to
>> main hbase project.  sort of like the gzip/lzo algorithms.
>>    - if it's strictly isolated, it'll be easier to keep it well tested
>> correctness/performance and let more people tinker with it to make it
>> faster.  it'll also force us to come up with the right interfaces to
>> other compression implementations.
>> * current HFileWriter sends KeyValues to the OutputStream as soon as
>> they're
>> processed, but would it be ok to queue up a whole block in memory and
>> it all at once?
>>    - i'd vote for yes.  it makes it easier to arrange the data to be
>> read-friendly.  also, we're only talking about one block of data which
>> presumably a fraction of the block cache's size
>> * should the block bytes be accessed as a byte[] of ByteBuffer.  i know
>> there's been work on off-heap cache, but i've read the blocks are pulled
>> on-heap before they're parsed (??)
>>    - see 
>> for a comparison of byte[] vs ByteBuffer speed tests.  ByteBuffer looks
>> ~2-10x slower, but some people need the off-heap cache
>>    - i'll propose maintaining separate reading algorithms for each,
>> that the underlying bytes are in the exact same format.  should be
>> to copy/paste the code and replace bytes[i] with ByteBuffer.get(i), and
>> create parallel versions of static util methods
>> * each block has some metadata wrapped in a class called PtBlockMeta.
>> HBase currently have a way to store its values as java primitives on the
>> heap rather than parsing them out of the byte[]/ByteBuffer on every
>>  if they have to be parsed out on every block access, that could take
>> time than the Get query it's trying to perform.
>>    - I know there's a new Cachable interface or something like that.
>> that already supports it or could be enhanced
>> What jiras do you think i should make?
>> Look forward to hearing people's feedback,
>> Matt

View raw message