hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: key design
Date Mon, 21 May 2012 15:12:25 GMT

Why separate tables per log type? Why not a single table with the key:

<log type><date>

That's roughly the approach used by OpenTSDB (with "metric id" instead of "log type", but
same idea). OpenTSDB goes further by "bucketing" values into rows using a base timestamp in
the row key and offset timestamps in the column qualifiers, for more efficiency.

If you start the key with log type, you can do partial scans for a specific date, but only
within a single log type; to scan across all log types, you'd need to do multiple scans (one
per log type). If you have a fixed and relatively small number of log types (less than 20,
say), this could still be the best approach, but if it's a very frequent operation to scan
by time across all log types and you have a lot of log types, you might want to reconsider

The case for using a hash as the start of the key is really just to avoid region server "hot
spotting" (where, even though you have lots of machines, all your insert traffic is going
to one of them because all inserts are happening "now" and only one region server contains
the range that "now" is in). Salting or hashing a timestamp based key spreads that out so
the load is evenly distributed; but it prevents you from doing linear scans over the time
dimension. That's why OpenTSDB (and similar models) start the key with another value that
"spreads" the data over all servers.


On May 21, 2012, at 7:56 AM, mete wrote:

> Hello folks,
> i am trying to come up with a nice key design for storing logs in the
> company. I am planning to index them  and store row key in the index for
> random reads.
> I need to balance the writes equally between the R.S. and i could not
> understand how opentsdb does that with prefixing the metric id. (i related
> metric id with the log type) In my log storage case a log line just has a
> type and a date and the rest of it is not really very useful information.
> So i think that i can create a table for every distinct log type and i need
> a random salt to route to a different R.S. similar to this:
> <salt>-<date>
> But with this approach i believe i will lose the ability to do effective
> partial scans to a specific date. (if for some reason i need that) What do
> you think? And for the salt approach do you use randomly generated salts or
> hashes that actually mean something? (like the hash of the date)
> I am using random uuids at the moment but i am trying to find a better
> approach, any feedback is welcome
> cheers
> Mete

View raw message