hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Corgan <mcor...@hotpads.com>
Subject Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)
Date Mon, 19 Sep 2011 01:43:39 GMT
I think there's 2 major issues right now, both related to the data and index
block formats.

First problem is the amount of duplicate information in the uncompressed
block format.  Taking 100GB of hot data from mysql and serving it out of
hbase might require 1-2TB of memory.  The main culprit there is the fact
that the row key is repeated for every cell.  The prefix trie not only
eliminates that penalty, but also removes the duplicate information between
consecutive row keys.  Then it builds a reverse trie of column qualifiers so
each cell can reference it's qualifier with a 1-2 byte index.  Timestamps
and op types can often be eliminated, and most integers inside the cell can
be represented in 1-2 bytes.  All of this is relatively easy for the data
and index blocks because they're immutable.  Doing it for the memstore is
another story...

Here's some contrived data to illustrate.  Row key is a (reversed
domain) url.  Column family is called "familyName".  ColQualifier is a
browser type.  Value is the number of views from that browser type.  Here's
the current representation <http://pastebin.com/7ks8kzJ2>, the prefix trie's
toString output <http://pastebin.com/cL4AkCPC> for the row keys, and removing
the whitespace <http://pastebin.com/4qiSXNh9> from the toString output.  The
current version is really wasteful.  Another example is storing counters
with long names.  OpenTSDB does this elegantly, but with a lot of work to
get around these issues.  It would be nice to not have to do all the

The second problem is the lack of random access to cell offsets within the
data block.  (I'm not 100% sure on this one, so please correct me if i'm
wrong).  I noticed how bad this problem is when i was storing historical
event logs with 8 byte keys and small values (so ~30b per cell).  I had to
increase block size to 256KB because the block indexes were too big to fit
in memory.  Then I needed fast random access to these events.  The problem
is that there are ~10,000 cells per block, so without random lookups inside
the block, it's seeking through ~5,000 keys for the average lookup.  That's
a crazy amount of overhead to retrieve a single cell.  Probably the quickest
solution to this problem is to store the offset of each key in a list of
integers at the end of the block so that it's possible to do a binary search
inside the block.  That would reduce it to ~14 avg memory accesses to find
the cell.

But, the prefix trie should actually be way faster than a binary search
because you don't have to keep comparing the beginning of the key to the one
you're looking for.  I'll save the details for another email or for code
comments.  In general, with a smarter block/index format we could be much
more efficient than the current method of comparing full key byte[] over and
over again.

Same random access problem also applies to the block index i think (correct
me if i'm wrong here too).

On Sat, Sep 17, 2011 at 9:57 AM, Andrew Purtell <apurtell@apache.org> wrote:

> Hi Matt,
> > My motivation for doing this is to make hbase a viable candidate for a
> > large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> > analytics use case, but i think hbase would be great for this.
> Really interested in hearing your thoughts as to why HBase currently is an
> -- whether or not "viable" -- at least suboptimal candidate for that
> purpose. It has been moving in the direction of being better for that
> purpose ever since 0.89. Where we can further improve would be a good
> discussion to have, the HBase constituency is not only analytics use cases
> as you point out.
> Best regards,
>     - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
> >________________________________
> >From: Matt Corgan <mcorgan@hotpads.com>
> >To: dev@hbase.apache.org
> >Sent: Friday, September 16, 2011 7:29 PM
> >Subject: Re: prefix compression implementation
> >
> >Ryan - thanks for the feedback.  The situation I'm thinking of where it's
> >useful to parse DirectBB without copying to heap is when you are serving
> >small random values out of the block cache.  At HotPads, we'd like to
> store
> >hundreds of GB of real estate listing data in memory so it can be quickly
> >served up at random.  We want to access many small values that are already
> >in memory, so basically skipping step 1 of 3 because values are already in
> >memory.  That being said, the DirectBB are not essential for us since we
> >haven't run into gb problems, i just figured it would be nice to support
> >them since they seem to be important to other people.
> >
> >My motivation for doing this is to make hbase a viable candidate for a
> >large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> >analytics use case, but i think hbase would be great for this.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message