hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Washusen <...@reactive.org>
Subject Re: Some performance questions about Indexing and Table schemas
Date Wed, 03 Feb 2010 20:31:28 GMT
Hi Chris,
You could take a look at the new Indexed HBase (IHBase) contrib package.
 It's currently immature and requires considerably more resources than
vanilla HBase but it can make a massive difference to scan times.

There is a Performance Evaluation (PE) attached to
HBASE-2167<https://issues.apache.org/jira/browse/HBASE-2167>. If
you are OK with building HBase from source and using patch files you can run
the Performance Evaluation yourself.  The PE does a good job of showing what
IHBase is good and bad at.  Otherwise, the Jira issue includes the "indexed"
scan run time (20 sequential scans for random values):
Without an index: 732989ms (36 seconds per scan)
With an index: 2160ms (108 milliseconds per scan)

At the moment there is a memory issue that causes the indexing process to
take considerably longer than is necessary.  This issue may prevent your use
case (loading lots of data per day) from working smoothly but I'm sure that
issue will be fixed soon.

If you wanted to give it a test run check out this post I sent to another
user on this list:

Let me know how you go...


On 4 February 2010 06:27, Chris Bates <christopher.andrew.bates@gmail.com>wrote:

> Also to note-- I followed these guidelines (
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> )
> and
> did not observe the same speedup. I'm not quite sure what sort of hardware
> boxes are required to get here.
> On Wed, Feb 3, 2010 at 1:40 PM, Chris Bates <
> christopher.andrew.bates@gmail.com> wrote:
> > Hello all,
> >
> > I've read around about doing table indexing and it seems like there are a
> > couple approaches which I'd like clarification.
> >
> > What I have been doing is near full table scans because I'm doing a lot
> of
> > aggregation and statistics for our analytics project.  We have nearly
> 7.5GB
> > of data per a day to load into HBase.  My schema has been Row: Timestamp,
> > ColFam1: col1...  , ColFam2: col1 ....  It takes somewhere close to 5
> hours
> > to load in all the data I need from HDFS MapReduce.  We currently are
> only
> > running HBase on 3 machines, about 1.5 - 2gb RAM each.  We're going to
> scale
> > out in the next month to 2 8core machines with 30gb of RAM each.
> >
> > With this in mind, I'm now focusing on performance.  I'm working on
> getting
> > LZO compression enabled on all the machines, but I was more curious as to
> > the best way to index.
> >
> > It seems like there are two strategies: Use the tableindexed package or
> > roll my own where I'd create a new table with the rowID's as the values
> from
> > the primary table column lookup. Then when I do a scan on the main table,
> I
> > would grab one value that satisfies my filters, and use that value to
> scan
> > over the index table to grab all the rows that satisfy it.
> >
> > Does anyone know about the performance of these two approaches, or if
> there
> > are others?  How does it affect loading?  I'd like to load in my 7.5gb of
> > data per day in a matter of minutes not hours, and then be able to query
> > columns in seconds not tens of minutes.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message