hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Forehand <luke.foreh...@networkedinsights.com>
Subject Secondary Index versus Full Table Scan
Date Tue, 03 Aug 2010 15:40:49 GMT
Thanks to the help of people on this mailing list and Cloudera, our team has
managed to get our 3 data node cluster with HBase running like a top.  Our
import rate is now around 3 GB per job which takes about 10 minutes.  This is
great.  Now we are trying to tackle reading.

With our current setup, a map reduce job with 24 mappers performing a full table
scan of ~150 million records takes ~1 hour.  This won't work for our use case,
because not only are we continuing to add more data to this table, but we are
asking many more questions in a day.  To increase performance, the first thought
was to use a secondary index table, and do range scans of the secondary index
table, iteratively performing GET operations of the master table.  

In testing the average GET operation took 37 milliseconds.  At that rate with 24
mappers it would take ~1.5 hours to scan 3 million rows.  This still seems like
a lot of time.  37 milliseconds per GET is nice for "real time" access from a
client, but not during massive GETs of data in a map reduce job.

My question is, does it make sense to use secondary index tables in a map reduce
job of this scale?  Should we not be using HBase for input in these map reduce
jobs and go with raw SequenceFile?  Do we simply need more nodes?  

Here are the specs for each of our 3 data nodes:
2x CPU (2.5 GHZ nehalem ep quad core)
24 GB RAM (4gb / region server )
4x 1tb hard drives

Region size: 1GB


Luke Forehand
Software Engineer

View raw message