hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murali Krishna. P" <muralikpb...@yahoo.com>
Subject Re: HBase secondary index performance
Date Mon, 06 Sep 2010 05:02:05 GMT
Hi,
   My row size is around 300 bytes with total 20 columns. I tried the custom 
indexing without the write to WAL. Currently having only 2 tables, one for the 
main table and another for all 20 indexes. My key to the index table is 
columnValue+columnName+rowKey.
I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is 
probably comparable with your numbers based on the data size.
  I have some doubts with the hbase write implementation. 
* Is this the max that we can achieve with any number of region servers? Why 
adding region servers not improving the write performance? Is it because when 
the data doesn't exist in the table, it always writes to one region ? 

* Probably writing to an existing, well distributed table might give better 
performance since the writes will be across machines ? In that case, if we have 
multiple tables (one per index), will it be better during the initial write 
itself (since each table has different region) ??

 Thanks,
Murali Krishna




________________________________
From: Andrey Stepachev <octo47@gmail.com>
To: user@hbase.apache.org
Sent: Sun, 5 September, 2010 11:54:45 PM
Subject: Re: HBase secondary index performance

2010/9/5 Murali Krishna. P <muralikpbhat@yahoo.com>:
> Hi,
>        Thanks for the detailed explanation, I liked the idea of timestamp
> check, this will be good enough for us and I can put a periodic MR cleaner.
> However I need some help in understanding the 30K number that was claimed.

Real insert rate will depend on size of row, size of write buffer etc.
In case of simple row with one long  per row i got 30k requests/second
(shown in hbase).
Json serialised objects 100-700bytes each, with validation I can insert 2-6k
objects (json) per second.

With
> the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index 
columns).
> I understood that there arean additional reads that indextable does but  25X
> improvement that you got is very impressive. Can you please help me to
> understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

Did you try to insert data into non indexed region (disable
indexedtables extension)?
What numbers did you got?

>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <octo47@gmail.com>
> To: user@hbase.apache.org
> Sent: Sun, 5 September, 2010 3:53:26 AM
> Subject: Re: HBase secondary index performance
>
> 2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
>
>>        * custom indexing is good, but our data keeps changing every day. So,
>>probably
>> indextable is the best option for us
>
> In case of custom indexing you can use timestamps to check, that index
> record still valid.
> (or ever simply recheck existance of the value)
> Also you need regular index cleanup (mr job or some custom application).
>
> To index some row identified by 'key' having 'value', we can create
> index table,
> where key will be [value:key] and insert rows every time, when we insert
> our values. We will got 30k rows/s/node.
> When we want to find all 'value', we scan [value:0000, value:9999] and
> find all keys,
> which point to rows, containing values.
> We scan index, random get rows, recheck, that index is still valid
> (check value or timestamp, index timestamp should be >= value timestamp) and
> return only valid values (may be we can even delete on the fly when we
> got negative
> result to automatically clenup stale data).
>
>
>>        * Just added one more regionserver and it did not help. Actually it 
>went
>>back
>> to 60/s for some strange reason(with one client). The requests in the hbase 
ui
>> is not uniform across 2 region servers. One server is doing around 2000 and
> the
>> other 500. Probably once the region gets split and when we have lots of data,
>> writes will improve ? (Now it is just writing to one region for the main
> table)
>
> Looks like all data goes to one region server. Try to make more random writes
> (may be you should make key as random uuid or other key randomization 
>technique)
>
>>        * Is there some way to do bulk load the indexedtable? Earlier I have
>>used the
>> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
>> whether it works with indexed table.
>
> No sure, but you can look at source code, and try to emulate indexing
> operations in
> your code after regular bulk loading.
>
>>
>>
>>  Thanks,
>> Murali Krishna
>>
>>
>
> Andrey.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message