hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murali Krishna. P" <muralikpb...@yahoo.com>
Subject Re: HBase secondary index performance
Date Sun, 05 Sep 2010 05:17:54 GMT
        Thanks for the detailed explanation, I liked the idea of timestamp 
check, this will be good enough for us and I can put a periodic MR cleaner. 
However I need some help in understanding the 30K number that was claimed. With 
the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns). 
I understood that there arean additional reads that indextable does but  25X 
improvement that you got is very impressive. Can you please help me to 
understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

Murali Krishna

From: Andrey Stepachev <octo47@gmail.com>
To: user@hbase.apache.org
Sent: Sun, 5 September, 2010 3:53:26 AM
Subject: Re: HBase secondary index performance

2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:

>        * custom indexing is good, but our data keeps changing every day. So, 
> indextable is the best option for us

In case of custom indexing you can use timestamps to check, that index
record still valid.
(or ever simply recheck existance of the value)
Also you need regular index cleanup (mr job or some custom application).

To index some row identified by 'key' having 'value', we can create
index table,
where key will be [value:key] and insert rows every time, when we insert
our values. We will got 30k rows/s/node.
When we want to find all 'value', we scan [value:0000, value:9999] and
find all keys,
which point to rows, containing values.
We scan index, random get rows, recheck, that index is still valid
(check value or timestamp, index timestamp should be >= value timestamp) and
return only valid values (may be we can even delete on the fly when we
got negative
result to automatically clenup stale data).

>        * Just added one more regionserver and it did not help. Actually it went 
> to 60/s for some strange reason(with one client). The requests in the hbase ui
> is not uniform across 2 region servers. One server is doing around 2000 and 
> other 500. Probably once the region gets split and when we have lots of data,
> writes will improve ? (Now it is just writing to one region for the main 

Looks like all data goes to one region server. Try to make more random writes
(may be you should make key as random uuid or other key randomization technique)

>        * Is there some way to do bulk load the indexedtable? Earlier I have 
>used the
> bulk loader tool (mapreduce job which creates the regions offline) but not 
> whether it works with indexed table.

No sure, but you can look at source code, and try to emulate indexing
operations in
your code after regular bulk loading.

>  Thanks,
> Murali Krishna


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message