hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TuX RaceR <tuxrace...@gmail.com>
Subject Re: random access and hotspots
Date Thu, 11 Mar 2010 13:37:00 GMT
Thanks Alex for  your answer.

I am not yet at a stage where I can measure the performance (I am still 
at the db design stage, initial population) but my understanding what 
that randomizing the keys was a way of avoiding keys hotspots.
To simplify let's assume that have documents attached to users that I 
need to search by date.
I have two tables: one "Random" optimized to random access and one 
"Indexes" optimized for sequential access scanners.

'Random' stores document details:
Random:
doc_1-> Title:"some title1",Text:"some longer 
text1",user:1,CreateDate:2010-01-01
doc_2-> Title:"some title2",Text:"some longer 
text2",user:1,CreateDate:2010-01-02
....

'Indexes' stores document indexes (for instance here is an index on date 
and date+user):
date_2100101:id:1
date_2100102:id:2
...
date_user1_2100101:id:1
date_user1_2100102:id:2


As a user typically add many documents in a short period of time, it is 
usual to have that documents obtained by the scanner are also in the 
same order in the Random table (without randomization).
So, once I get the IDs of the documents from the scanner query, I need 
to fork concurrent threads/processes to get the document details: that 
(from what I understand) would create a key hotspot in the 'Random' table.
Is my reasoning above correct? My feeling is that a typical hbase 
application do both scanner/random access patterns alternatively.

Another question I have until I test this is how many random search 
hbase will stand. The scanner will present links to the documents 
(paging implemantion), so I am not sure what a realistic value of 
document per page could be: 10, 20 or 100? As (at least) one new socket 
(is that true?) is created at each random access request, I am affraid 
such a design could bring the hbase layer down (until maybe 
http://issues.apache.org/jira/browse/HBASE-1845 is fixed)

Thanks
TuX



Alex Baranov wrote:
> Hello Tux,
>
> Accessing a table in "random access"-manner is not the reason for
> randomizing keys. You will likely need to randomize your keys only for
> better performance during importing existed large dataset into HBase.
> Otherwise if you don't have insertion rate bigger than 20K records/sec I
> wouldn't suggest you to think about this issue. It would be great if you
> tell us more about your use-case.
>
> MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash) are
> all mechanisms you might consider.
>
> Alex Baranau
>
> sematext.com
> http://en.wordpress.com/tag/hadoop-ecosystem-digest/
>
> On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tuxracer69@gmail.com> wrote:
>
>   
>> Hello List,
>>
>> I'll be accessing a table mainly in random access and I am looking for an
>> efficient way of randomizing the keys.
>> I thought about a MD5 hash of the ID of the record, but as MD5 returns a
>> string of chars [0-9A-F] I was wondering if there was a better method to
>> use.
>>
>> Thanks
>> TuX
>>
>>     
>
>   


Mime
View raw message