hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: IHBase indexes persistence
Date Sat, 20 Mar 2010 23:26:24 GMT
So it will be difficult to make generic index joins that are
efficient.  Since the data between the index and the main table reside
on different machines, the RPC calls involved in there can quickly
destroy any notion of doing these kinds of things fast.
Denormalization, ie: copying data into other tables for faster access
is the likely candidate.

Another thing to ask yourself is - should my highly relational data
belong in a non-relational data store?  If you have small amounts of
high relational data, and big stores of non-relational data, perhaps a
hybrid approach might be appropriate?

-ryan

On Sat, Mar 20, 2010 at 4:23 PM, Andrey Kolyadenko <crypto5@mailx.ru> wrote:
> The problem comes when you trying to filter based on number of columns in
> OLAP-like queries (i.e. you want to retrieve the count of transaction for
> some customer and some date range). It's not so easy to implement such logic
> effectively, some indexes join algorithm should be implemented there, and
> since HBase supposed to deal with very large data sets, it could be tricky.
>
> On Sat, 20 Mar 2010 16:08:10 -0700
>  Ryan Rawson <ryanobjc@gmail.com> wrote:
>>
>> Another way to think about it is that IHBase helps when the data is
>> not dense (ie: every row has the column you may be looking for), and
>> not sparse (where 1 column in millions or billions match) but
>> somewhere inbetween.  That sweet spot where you might return anywhere
>> between 10-30% of the rows from a region.
>>
>> Of course these are just suggestions and recommendations not hard and
>> fast rules.
>>
>> You might also want to look at THBase - it uses a transactional add-on
>> to maintain a secondary index (ie: another table that is an index of a
>> primary table).  It has different performance characteristics (one
>> write is translated into many writes and involves an RPC), but an
>> option to consider.
>>
>> Finally, you can always maintain secondary indexes by yourself in your
>> app.  Write and update 2 tables (the primary, the index).  This is
>> obviously less integrated and simple but also works.
>>
>> -ryan
>>
>> On Sat, Mar 20, 2010 at 4:00 PM, Dan Washusen <dan@reactive.org> wrote:
>>>
>>> Hey Tux,
>>> I've put some comments inline...
>>>
>>> On 21 March 2010 09:13, TuX RaceR <tuxracer69@gmail.com> wrote:
>>>>
>>>> Hello Hbase user List!
>>>>
>>>> The feature provided by IHbase is very appealing. It seems to correspond
>>>> to
>>>> a use case very common in applications (at least in mine ;) )
>>>
>>> The functionality of IHBase might not be as useful as you think.  Take
>>> the following very basic user table layout:
>>>
>>> username (key) | email | name | password
>>>
>>> That table layout works great when you want to find a user by
>>> username, for example, when the user logs in.  You can simply do a get
>>> on the table with the username.  Now you need to add functionality to
>>> enable a user to retrieve their forgotten password.  The seemingly
>>> obvious solution with IHBase would be to add secondary index to the
>>> email column.  You could then perform a scan on the table with the
>>> appropriate index hint to fetch the user by their email address.  That
>>> solution would work while your dataset is small (one or two regions)
>>> but as your dataset grows and spans many hundreds of regions it's no
>>> longer a viable option.  The reason it's not a viable option is that
>>> IHbase maintains an index on the email column per region.  In order to
>>> find the row that has the email address you are looking for the scan
>>> must contact every region.  The scan would still return reasonable
>>> quickly (say each region responded in a few milliseconds) but it's
>>> still far to resource intensive...
>>>
>>> The way to make scans fast in HBase is to provide a start row and stop
>>> row and the same rule applies to IHBase.  It's just that with IHBase
>>> the scan will return much faster if the start and stop rows span a
>>> large range...
>>>
>>>>
>>>> Dan Washusen wrote:
>>>>>
>>>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>>>> value and each row key in memory...
>>>>>
>>>>
>>>> Is there a more robust indexing on the roadmap?
>>>
>>> In IHbase yes, but probably not soon.
>>>
>>>> HBase if I understand well proposes an opensource version of Google
>>>> Bigtable.
>>>> To me the most striking difference between Hbase and Bigtable is for
>>>> narrowing searches; the example below shows what I mean by narrowing:
>>>>
>>>> If in Google you search for the word
>>>>
>>>> hbase:
>>>>
>>>> (i.e using:
>>>> http://www.google.com/search?q=hbase
>>>> )
>>>> you get a fast answer
>>>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>>>> seconds))
>>>>
>>>> Now if you search all pages coming for the hadoop.apache.org host name
>>>> (or
>>>> base URL), that is with the query:
>>>>
>>>> hbase +site:hadoop.apache.org
>>>>
>>>> (i.e using the URL:
>>>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>>>> )
>>>> you get a pretty fast answer to:
>>>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org*
>>>> for
>>>> *hbase*. (*0.12* seconds) )
>>>>
>>>> It seems to me that the second search uses a secondary index on a column
>>>> named 'site' to scan the 'hbase' based keys. Obviously Google found a
>>>> good
>>>> way to implement this (good= fast and scalable)
>>>> Is this Google second indexing documented somewhere? Is that implemented
>>>> using something like IHbase or more something like THbase, or something
>>>> else?
>>>
>>> What Ryan said.
>>>
>>>> Also, why IHbase stays in the 'contrib' tree? Is that because the code
>>>> is
>>>> not at the same level as the main hbase code (not as tested, not as
>>>> robust,
>>>> etc...)?
>>>
>>> IHBase is still very young (it was first released along with 0.20.3).
>>> As you can see from this email thread it's not as robust as it should
>>> be... :)
>>>
>>>>
>>>> Thanks
>>>> TuX
>>>>
>>>>
>>>
>
>
> ---
> Миллионы анкет ждут Вас на на http://mylove.in.ua
> Немедленная регистрация здесь http://mylove.in.ua/my/reg.phtml
>
> Биржа ссылок, тысячи отзывов о нас в Рунете
> http://www.sape.ru/r.7fddbf83ee.php
>
>

Mime
View raw message