hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Axiak <m...@axiak.net>
Subject Re: Using separator/delimiter in HBase rowkey?
Date Mon, 08 Jul 2013 15:14:51 GMT
Hello Jason,

Have you considered the following rowkey?

  murmur_128(userId) + timestamp + userId ?

This handles both of your cases as (1) murmur 128 is much faster than
md5 so will have very low overhead and (2) the userid at the end of
the key will ensure that no murmur collisions will cause issues. This
key also handle incrementing userIds well because close userIds will
likely be in separate regions.


On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <jason.huang@icare.com> wrote:
> Hello,
> I am trying to get some advice on pros/cons of using separator/delimiter as
> part of HBase row key.
> Currently one of our user activity tables has a rowkey design of
> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
> include '^').
> This is designed for the two common use cases in our system:
> (1) If we come from a context where the UserID is known, we can do a scan
> easily for all the user activities with a startRowKey and stopRowKey.
> (2) If we come from a external networked table where the row key of this
> user activity table is stored and can be retrieved as activityRowKey, then
> we can use the following code to parse out the UserID and do the same scan
> as in (1):
>     String activityRowKeyStr = Bytes.toString(activityRowKey);
>     String userId =
> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
> Then I can set startRowKey and stopRowKey for the scan based on userId.
> Here we get benefit of having the User ID as part of the row key with the
> separator (comparing to another solution that stores the userID as one of
> the columns in the user activity table).
> The reason I pick a separator after UserID is that sometimes we may not get
> a fixed length string of the UserID value. At one point I actually thought
> of using MD5 to hash the UserID and make it a fixed length, however, the
> possibility of collision and possible overhead of applying the hash
> function makes me pick the separator "^".
> My question:
> (1) I kind of make the argument that using a separator is kind of better
> than using a MD5 hash value. Does that seem reasonable? Could you comments
> on other pros and cons that I might miss (as the bases for my argument)?
> (2) On using a separator/delimiter, besides the requirements that this
> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
> other requirements? Are there any special separator/delimiters that are
> better/worse than the average ones?
> thanks!
> Jason

View raw message