hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamecha, Abhishek" <apame...@x.com>
Subject RE: HBase table row key design question.
Date Tue, 02 Oct 2012 23:36:09 GMT
For 1. I wouldn't worry about that problem until it really happens. Just my opinion. If you
really want to solve it you will need to generate a unique id per row-key 'put' outside of
hbase [ say some hash of serverip + timestamp etc ] and append it to the end of your row key.

For 2. You can investigate bloom filters and that can help you filter out invalid rows  faster.
Also, there are way to organize names based on phonetics. You can, may be build a secondary
table in background with phonetic keys as row keys.


-----Original Message-----
From: Jason Huang [mailto:jason.huang@icare.com] 
Sent: Tuesday, October 02, 2012 2:38 PM
To: user@hbase.apache.org
Subject: Re: HBase table row key design question.

Thanks Mohammad.

The issue about phone number is that it tends to change over time and we think name and DOB
are more reliable. SSN is more unique but the issue is that we can't force the user to provide
it. Basically we have limited information that can be used.



On Tue, Oct 2, 2012 at 3:30 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
> Hello Sir,
>      Although we should always try to keep the rowkey length as less 
> as possible, but still a short key that doesn't help much in faster 
> data access is also of no use. So, it totally depends on that 
> particular use case. However, in your case, how about using "phone number" as the rowkey??
> Since it is always unique, you will always get the correct result with 
> much shorter rowkey. It's just that in this case you will have to ask 
> for the user's phone number instead of name and DOB.
> Regards,
>     Mohammad Tariq
> On Tue, Oct 2, 2012 at 7:58 PM, Jason Huang <jason.huang@icare.com> wrote:
>> Hello,
>> I am designing a HBase table for users and hope to get some 
>> suggestions for my row key design. Thanks...
>> This user table will have columns which include user information such 
>> as names, birthday, gender, address, phone number, etc... The first 
>> time user comes to us we will ask all these information and we should 
>> generate a new row in the table with a unique row key. The next time 
>> the same user comes in again we will ask for his/her names and 
>> birthday and our application should quickly get the row(s) in the 
>> table which meets the name and birthday provided.
>> Here is what I am thinking as row key:
>> {first 6 digit of user's first name}_{first 6 digit of user's last 
>> name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the 
>> first time}
>> However, I see a few questions from this row key:
>> (1) Although it is not very likely but there could be some small 
>> chances that two users with same name and birthday came in at the 
>> same day. And the two requests to generate new user came at the same 
>> time (the timestamps were defined in the HTable API and happened to 
>> be of the same value before calling the put method). This means the 
>> row key design above won't guarantee a unique row key. Any 
>> suggestions on how to modify it and ensure a unique ID?
>> (2) Sometimes we will only have part of user's first name and/or last 
>> name. In that case, we will need to perform a scan and return 
>> multiple matches to the client. To avoid scanning the whole table, if 
>> we have user's first name, we can set start/stop row accordingly. But 
>> then if we only have user's last name, we can't set up a good start/stop row.
>> What's even worse, if the user provides a "sounds-like" first or last 
>> name, then our scan won't be able to return good possible matches.
>> Does anyone ever use names as part of the row key and encounter this 
>> type of issue?
>> (3) The row key seems to be long (30+ chars), will this affect our 
>> read/write performance? Maybe it will increase the storage a bit (say 
>> we have 3 million rows per month)? In other words, does the length of 
>> the row key matter a lot?
>> thanks!
>> Jason

View raw message