Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of mcorgan@hotpads.com designates
 209.85.160.169 as permitted sender)
MIME-Version: 1.0
Date: Tue, 13 Sep 2011 00:44:15 -0700
Message-ID: 
 <CAOKsKJXoAKLF9xyCuaekQ0CKsHkVzx6Z5uyejfP+Zzm-oTXhvA@mail.gmail.com>
Subject: prefix compression implementation
From: Matt Corgan <mcorgan@hotpads.com>
To: dev <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=000e0cd48df6751afb04accdcfab

--000e0cd48df6751afb04accdcfab
Content-Type: text/plain; charset=UTF-8

Hi devs,

I put a "developer preview" of a prefix compression algorithm on github.  It
still needs some details worked out, a full set of iterators, about 200
optimizations, and a bunch of other stuff...  but, it successfully passes
some preliminary tests so I thought I'd get it in front of more eyeballs
sooner than later.

https://github.com/hotpads/hbase-prefix-trie

It depends on HBase's Bytes.java and KeyValue.java, which depends on hadoop.
 Those jars are in there, along with some others for full HFile testing in
the near future.

A good place to start looking at the code
is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder.  It's
the main driver of the compaction side of things, taking KeyValues (in
sorted order), and generates a byte[] to be saved as a disk block.  Then for
reading it back in, there is trie.compact.read.PtIterator which takes the
byte[] and spits out KeyValues.  The main test class is
trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues to make
sure the outputs are the same as the inputs.
 trie.compact.row.node.PtRowNode is the format of a single trie node in the
underlying byte[].

The current version generates a lot of little objects (garbage) while
writing and reading.  I plan to optimize it to the point where most
variables are primitives on the stack (at least when reading), but I think
these intermediate classes are important for debugging.  I'll probably try
to keep them around going forward and develop a more compact version in
parallel.

It uses trie style compression for the row keys and column qualifiers, where
pointers between nodes are compacted ints.  It keeps a list of compacted,
de-duped deltas for timestamps, and if they're all the same, it stores only
one (long) per block.  If all KeyValue.TYPE operations are the same, then it
only stores one (byte) per block.

It's designed around efficient cpu cache usage and elimination of 8 byte
pointers, so should be fast.  Get calls can traverse the trie nodes to dig
up a row key while barely loading anything from memory to cache, as opposed
to current hbase which may load the better part of a block into cache.
 Scanners/Filters/Comparators can all be designed to be trie-aware so they
can iterate through 20 columnQualifiers in the same row without constantly
re-scanning the rowKey bytes... etc...


Here are a few questions we'll need to answer at some point:

* where should this code go?
    - i'd vote for keeping it isolated and limiting references back to the
main hbase project.  sort of like the gzip/lzo algorithms.
    - if it's strictly isolated, it'll be easier to keep it well tested for
correctness/performance and let more people tinker with it to make it
faster.  it'll also force us to come up with the right interfaces to allow
other compression implementations.

* current HFileWriter sends KeyValues to the OutputStream as soon as they're
processed, but would it be ok to queue up a whole block in memory and write
it all at once?
    - i'd vote for yes.  it makes it easier to arrange the data to be more
read-friendly.  also, we're only talking about one block of data which is
presumably a fraction of the block cache's size

* should the block bytes be accessed as a byte[] of ByteBuffer.  i know
there's been work on off-heap cache, but i've read the blocks are pulled
on-heap before they're parsed (??)
    - see org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler
for a comparison of byte[] vs ByteBuffer speed tests.  ByteBuffer looks
~2-10x slower, but some people need the off-heap cache
    - i'll propose maintaining separate reading algorithms for each, given
that the underlying bytes are in the exact same format.  should be possible
to copy/paste the code and replace bytes[i] with ByteBuffer.get(i), and
create parallel versions of static util methods

* each block has some metadata wrapped in a class called PtBlockMeta.  does
HBase currently have a way to store its values as java primitives on the
heap rather than parsing them out of the byte[]/ByteBuffer on every access?
 if they have to be parsed out on every block access, that could take more
time than the Get query it's trying to perform.
    - I know there's a new Cachable interface or something like that.  maybe
that already supports it or could be enhanced


What jiras do you think i should make?

Look forward to hearing people's feedback,
Matt

--000e0cd48df6751afb04accdcfab--