hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Secondary Index versus Full Table Scan
Date Tue, 03 Aug 2010 16:14:51 GMT
On Tue, Aug 3, 2010 at 11:40 AM, Luke Forehand
<luke.forehand@networkedinsights.com> wrote:
> Thanks to the help of people on this mailing list and Cloudera, our team has
> managed to get our 3 data node cluster with HBase running like a top.  Our
> import rate is now around 3 GB per job which takes about 10 minutes.  This is
> great.  Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full table
> scan of ~150 million records takes ~1 hour.  This won't work for our use case,
> because not only are we continuing to add more data to this table, but we are
> asking many more questions in a day.  To increase performance, the first thought
> was to use a secondary index table, and do range scans of the secondary index
> table, iteratively performing GET operations of the master table.
>
> In testing the average GET operation took 37 milliseconds.  At that rate with 24
> mappers it would take ~1.5 hours to scan 3 million rows.  This still seems like
> a lot of time.  37 milliseconds per GET is nice for "real time" access from a
> client, but not during massive GETs of data in a map reduce job.
>
> My question is, does it make sense to use secondary index tables in a map reduce
> job of this scale?  Should we not be using HBase for input in these map reduce
> jobs and go with raw SequenceFile?  Do we simply need more nodes?
>
> Here are the specs for each of our 3 data nodes:
> 2x CPU (2.5 GHZ nehalem ep quad core)
> 24 GB RAM (4gb / region server )
> 4x 1tb hard drives
>
> Region size: 1GB
>
> Thanks,
>
> Luke Forehand
> Software Engineer
> http://www.networkedinsights.com
>
>

Generally speaking: If you are doing full range scans of a table
indexes will not help. Adding indexes will make the performance worse,
it will take longer to load your data and now fetching the data will
involve two lookups instead of one.

If you are doing full range scans adding more nodes should result in
linear scale up.

Mime
View raw message