hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Arthur <mum...@gmail.com>
Subject Re: Is it necessary to set MD5 on rowkey?
Date Wed, 19 Dec 2012 23:04:03 GMT
I wasn't really intending to describe a schema for a "webtable", but 
rather coming up with a contrived example of a compound key where you 
want to avoid hotspotting.

The point, which has been reiterated a few times, is that there is not 
one solution for all row key requirements. Hashing/salting is just 
another tool in the tool belt.

On 12/19/12 5:28 PM, Andrew Purtell wrote:
> I generally agree. I built a webtable design once. We dropped the scheme
> and reversed the domain to support "suffix glob" type queries over a group
> of related hosts. There is then a natural hotspot at "com" but salting
> would only have dispersed queries that should go to one row (or a group of
> adjacent rows) over multiple regionservers, actually hurting query
> efficiency. Instead we set the region split threshold low in the beginning,
> under the assumption that the resulting splits in the keyspace from the
> initial stream of URLs would approximate the overall distribution, then
> turned up the split threshold when entering production steady state.
> On Wed, Dec 19, 2012 at 2:15 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>> On Wed, Dec 19, 2012 at 1:26 PM, David Arthur <mumrah@gmail.com> wrote:
>>> Let's say you want to decompose a url into domain and path to include in
>>> your row key.
>>> You could of course just use the url as the key, but you will see
>>> hotspotting since most will start with "http".
>> Doesn't the original Bigtable paper [0] design around this problem by
>> dropping the protocol and only storing the domain? *goes to check* Yes, it
>> does.
>> Personally, I've never encountered an HBase schema design problem where
>> salting really nailed it. It's an okay place to start with initial designs,
>> especially if you don't know your data well. I'm a big fan of using the
>> natural variance in the data itself to solve this problem. OpenTSDB does
>> this quite well, IMHO. Plus, it's kind of a game or data puzzle -- how to
>> use the data's nature to your advantage :)
>> Just my 2ยข
>> -n
>> [0]:
>> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf

View raw message