hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bates <christopher.andrew.ba...@gmail.com>
Subject Re: Some performance questions about Indexing and Table schemas
Date Wed, 03 Feb 2010 19:27:32 GMT
Also to note-- I followed these guidelines (
http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html)
and
did not observe the same speedup. I'm not quite sure what sort of hardware
boxes are required to get here.

On Wed, Feb 3, 2010 at 1:40 PM, Chris Bates <
christopher.andrew.bates@gmail.com> wrote:

> Hello all,
>
> I've read around about doing table indexing and it seems like there are a
> couple approaches which I'd like clarification.
>
> What I have been doing is near full table scans because I'm doing a lot of
> aggregation and statistics for our analytics project.  We have nearly 7.5GB
> of data per a day to load into HBase.  My schema has been Row: Timestamp,
> ColFam1: col1...  , ColFam2: col1 ....  It takes somewhere close to 5 hours
> to load in all the data I need from HDFS MapReduce.  We currently are only
> running HBase on 3 machines, about 1.5 - 2gb RAM each.  We're going to scale
> out in the next month to 2 8core machines with 30gb of RAM each.
>
> With this in mind, I'm now focusing on performance.  I'm working on getting
> LZO compression enabled on all the machines, but I was more curious as to
> the best way to index.
>
> It seems like there are two strategies: Use the tableindexed package or
> roll my own where I'd create a new table with the rowID's as the values from
> the primary table column lookup. Then when I do a scan on the main table, I
> would grab one value that satisfies my filters, and use that value to scan
> over the index table to grab all the rows that satisfy it.
>
> Does anyone know about the performance of these two approaches, or if there
> are others?  How does it affect loading?  I'd like to load in my 7.5gb of
> data per day in a matter of minutes not hours, and then be able to query
> columns in seconds not tens of minutes.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message