hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Huang <jason.hu...@icare.com>
Subject HBase table row key design question.
Date Tue, 02 Oct 2012 14:28:03 GMT

I am designing a HBase table for users and hope to get some
suggestions for my row key design. Thanks...

This user table will have columns which include user information such
as names, birthday, gender, address, phone number, etc... The first
time user comes to us we will ask all these information and we should
generate a new row in the table with a unique row key. The next time
the same user comes in again we will ask for his/her names and
birthday and our application should quickly get the row(s) in the
table which meets the name and birthday provided.

Here is what I am thinking as row key:

{first 6 digit of user's first name}_{first 6 digit of user's last
name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the
first time}

However, I see a few questions from this row key:

(1) Although it is not very likely but there could be some small
chances that two users with same name and birthday came in at the same
day. And the two requests to generate new user came at the same time
(the timestamps were defined in the HTable API and happened to be of
the same value before calling the put method). This means the row key
design above won't guarantee a unique row key. Any suggestions on how
to modify it and ensure a unique ID?

(2) Sometimes we will only have part of user's first name and/or last
name. In that case, we will need to perform a scan and return multiple
matches to the client. To avoid scanning the whole table, if we have
user's first name, we can set start/stop row accordingly. But then if
we only have user's last name, we can't set up a good start/stop row.
What's even worse, if the user provides a "sounds-like" first or last
name, then our scan won't be able to return good possible matches.
Does anyone ever use names as part of the row key and encounter this
type of issue?

(3) The row key seems to be long (30+ chars), will this affect our
read/write performance? Maybe it will increase the storage a bit (say
we have 3 million rows per month)? In other words, does the length of
the row key matter a lot?



View raw message