hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: Is it necessary to set MD5 on rowkey?
Date Wed, 19 Dec 2012 23:07:45 GMT
> Plus, it's kind of a game or data puzzle -- how to
> use the data's nature to your advantage :)

Well said :), +1!

Re previous Qs about how/when salting with round-robin may differ from
taking first byte of a hash or similar ( and is more preferable to use):
0) you don't care about single gets (huge amount of cases) and one/more of
the following:
1) (very arguable) you want very-very even distribution
2) (very arguable) you don't want to spend time calculating hash from very
long keys
3) (might be the only "real" reason) you want more control to which buckets
(and hence Regions, RSs) you want to generate more write load in general or
at any particular point in time (see also [1] for one example when you may
wont that)

There may be more. That's from the top of my head.

To prevent/stop "holly war": yes, in majority of the cases you will use
hash-based solution. At least in the beginning and in simplest cases.

Alex Baranau
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -

[1] http://search-hadoop.com/m/TjkXd11qhLS

On Wed, Dec 19, 2012 at 6:04 PM, David Arthur <mumrah@gmail.com> wrote:

> I wasn't really intending to describe a schema for a "webtable", but
> rather coming up with a contrived example of a compound key where you want
> to avoid hotspotting.
> The point, which has been reiterated a few times, is that there is not one
> solution for all row key requirements. Hashing/salting is just another tool
> in the tool belt.
> On 12/19/12 5:28 PM, Andrew Purtell wrote:
>> I generally agree. I built a webtable design once. We dropped the scheme
>> and reversed the domain to support "suffix glob" type queries over a group
>> of related hosts. There is then a natural hotspot at "com" but salting
>> would only have dispersed queries that should go to one row (or a group of
>> adjacent rows) over multiple regionservers, actually hurting query
>> efficiency. Instead we set the region split threshold low in the
>> beginning,
>> under the assumption that the resulting splits in the keyspace from the
>> initial stream of URLs would approximate the overall distribution, then
>> turned up the split threshold when entering production steady state.
>> On Wed, Dec 19, 2012 at 2:15 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>  On Wed, Dec 19, 2012 at 1:26 PM, David Arthur <mumrah@gmail.com> wrote:
>>>  Let's say you want to decompose a url into domain and path to include in
>>>> your row key.
>>>> You could of course just use the url as the key, but you will see
>>>> hotspotting since most will start with "http".
>>> Doesn't the original Bigtable paper [0] design around this problem by
>>> dropping the protocol and only storing the domain? *goes to check* Yes,
>>> it
>>> does.
>>> Personally, I've never encountered an HBase schema design problem where
>>> salting really nailed it. It's an okay place to start with initial
>>> designs,
>>> especially if you don't know your data well. I'm a big fan of using the
>>> natural variance in the data itself to solve this problem. OpenTSDB does
>>> this quite well, IMHO. Plus, it's kind of a game or data puzzle -- how to
>>> use the data's nature to your advantage :)
>>> Just my 2ยข
>>> -n
>>> [0]:
>>> http://static.**googleusercontent.com/**external_content/untrusted_**
>>> dlcp/research.google.com/en/**us/archive/bigtable-osdi06.pdf<http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message