hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hegner, Travis" <THeg...@trilliumit.com>
Subject RE: Secondary Index versus Full Table Scan
Date Tue, 03 Aug 2010 18:06:50 GMT
Going out on a limb, I think it will perform MUCH faster with multiple copies, as the data
is already sitting in each mappers memory, ready to be accessed locally. The time to process
per mapper should be very dramatically reduced. With that in mind, you only have to scale
up as disk space requires it, and disk space is cheap.

With your current method, adding three more identical data nodes, is only going to cut your
time in half. So unless you have the budget to get the number of machines required, it's at
least worth a try to have multiple copies, at least that only costs your time.


Travis Hegner

-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
Sent: Tuesday, August 03, 2010 12:37 PM
To: user@hbase.apache.org
Subject: Re: Secondary Index versus Full Table Scan

Edward Capriolo <edlinuxguru@...> writes:

> Generally speaking: If you are doing full range scans of a table
> indexes will not help. Adding indexes will make the performance worse,
> it will take longer to load your data and now fetching the data will
> involve two lookups instead of one.
> If you are doing full range scans adding more nodes should result in
> linear scale up.


Can you clarify what "full range scan" means?  I am not doing "full" range
scans, but I am doing relatively large range scans (3 million records), so I
think what you are saying applies.  Thanks for the insight.

We initially implemented the secondary index out of a need to have our main data
sorted by multiple dimensions for various use cases.  Now I'm thinking it may be
better to have multiple copies of our main data, sorted in multiple ways, to
avoid the two lookups.  So I'm faced with two options right now; multiple copies
of the data sorted in multiple ways to do range scans, or buy a lot more servers
and do full scans.  Given these two choices, do people have general
recommendations on which makes the most sense?


The information contained in this communication is confidential and is intended only for the
use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful.  If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender or our IT
Department at  866.459.4599.

View raw message