hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: IHBase indexes persistence
Date Sat, 20 Mar 2010 23:08:10 GMT
Another way to think about it is that IHBase helps when the data is
not dense (ie: every row has the column you may be looking for), and
not sparse (where 1 column in millions or billions match) but
somewhere inbetween.  That sweet spot where you might return anywhere
between 10-30% of the rows from a region.

Of course these are just suggestions and recommendations not hard and
fast rules.

You might also want to look at THBase - it uses a transactional add-on
to maintain a secondary index (ie: another table that is an index of a
primary table).  It has different performance characteristics (one
write is translated into many writes and involves an RPC), but an
option to consider.

Finally, you can always maintain secondary indexes by yourself in your
app.  Write and update 2 tables (the primary, the index).  This is
obviously less integrated and simple but also works.


On Sat, Mar 20, 2010 at 4:00 PM, Dan Washusen <dan@reactive.org> wrote:
> Hey Tux,
> I've put some comments inline...
> On 21 March 2010 09:13, TuX RaceR <tuxracer69@gmail.com> wrote:
>> Hello Hbase user List!
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
> The functionality of IHBase might not be as useful as you think.  Take
> the following very basic user table layout:
> username (key) | email | name | password
> That table layout works great when you want to find a user by
> username, for example, when the user logs in.  You can simply do a get
> on the table with the username.  Now you need to add functionality to
> enable a user to retrieve their forgotten password.  The seemingly
> obvious solution with IHBase would be to add secondary index to the
> email column.  You could then perform a scan on the table with the
> appropriate index hint to fetch the user by their email address.  That
> solution would work while your dataset is small (one or two regions)
> but as your dataset grows and spans many hundreds of regions it's no
> longer a viable option.  The reason it's not a viable option is that
> IHbase maintains an index on the email column per region.  In order to
> find the row that has the email address you are looking for the scan
> must contact every region.  The scan would still return reasonable
> quickly (say each region responded in a few milliseconds) but it's
> still far to resource intensive...
> The way to make scans fast in HBase is to provide a start row and stop
> row and the same rule applies to IHBase.  It's just that with IHBase
> the scan will return much faster if the start and stop rows span a
> large range...
>> Dan Washusen wrote:
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>> Is there a more robust indexing on the roadmap?
> In IHbase yes, but probably not soon.
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>> If in Google you search for the word
>> hbase:
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>> hbase +site:hadoop.apache.org
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
> What Ryan said.
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
> IHBase is still very young (it was first released along with 0.20.3).
> As you can see from this email thread it's not as robust as it should
> be... :)
>> Thanks
>> TuX

View raw message