hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Questions on FuzzyRowFilter
Date Sun, 18 May 2014 17:27:10 GMT

I know and that’s the biggest problem. 
Salts by definition are random seeds. 

Now I have two new phrases. 

1) We want to remain on a sodium free diet. 
2) Learn to kick the bucket. 

When you have data that is coming in on a time series, is the data mutable or not? 

A better approach would be to redesign a second type of storage to handle serial data and
how the regions are split and managed. 
Or just not use HBase to store the underlying data in the first place and just store the index…
(Yes, I thought about this too.)


On May 16, 2014, at 7:50 PM, James Taylor <jtaylor@salesforce.com> wrote:

> Hi Mike,
> I agree with you - the way you've outlined is exactly the way Phoenix has
> implemented it. It's a bit of a problem with terminology, though. We call
> it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
> key, mod the hash with the SALT_BUCKET value you provide, and prepend the
> row key with this single byte value. Maybe you can coin a good term for
> this technique?
> FWIW, you don't lose the ability to do a range scan when you salt (or
> hash-the-key and mod by the number of "buckets"), but you do need to run a
> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
> the client does a merge sort among these scans. It performs well.
> Thanks,
> James
> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>> 3+ Years on and a bad idea is being propagated again.
>> Now repeat after me… DO NO USE A SALT.
>> Having a low sodium diet, especially for HBase is really good for your
>> health and sanity.
>> The salt is going to be orthogonal to the row key (Key).
>> There is no relationship to the specific Key.
>> Using a salt means you now use the ability to randomly spread the
>> distribution of data to avoid HOT SPOTTING.
>> However you lose the ability to seek for a specific row.
>> The hash whether you use SHA-1 or MD-5 is going to yield the same result
>> each and every time you provide the key.
>> But wait, the generated hash is 160 bits long. We don’t need that!
>> Absolutely true if you just want to randomize the key to avoid hot
>> spotting. There’s this concept called truncating the hash to the desired
>> length.
>> So to Adrien’s point, you can truncate it to a single byte which would be
>> sufficient….
>> Now when you want to seek for a specific row, you can find it.
>> The downside to any solution is that you lose the ability to do a range
>> scan.
>> <rant>
>> This simple fact has been pointed out several years ago, yet for some
>> reason, the use of a salt persists.
>> I’ve actually made that part of the HBase course I wrote and use it in my
>> presentation(s) on HBase.
>> It amazes me that the committers and regulars who post here still don’t
>> grok the fact that if you’re going to ‘SALT’ a row, you might as well not
>> use HBase and stick with Hive.
>> I remember Ed C’s rant about how preferential treatment on Hive patches
>> was given to vendors’ committers… that preferential treatment seems to also
>> be extended speakers at conferences. It wouldn’t be a problem if those said
>> speakers actually knew the topic… ;-)
>> Propagation of bad ideas means that you’re leaving a lot of performance on
>> the table and it can kill or cripple projects.
>> </rant>
>> Sorry for the rant…
>> -Mike
>> On May 3, 2014, at 4:39 PM, Software Dev <static.void.dev@gmail.com>
>> wrote:
>>> Ok so there is no way around the FuzzyRowFilter checking every single
>>> row in the table correct? If so, what is a valid use case for that
>>> filter?
>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
>>> client for accessing these tables is a Rails (not JRuby) application
>>> so we are stuck with either the Thrift or Rails client. Can either of
>>> these perform multiple gets/scans?
>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <adrien.mogenet@gmail.com>
>> wrote:
>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
>> be
>>>> split enough among all the possible regions, but you won't be able to
>>>> easily benefit from distributed scans to gather what you want.
>>>> Let say you want to split (time+login) with a salted key and you expect
>> to
>>>> be able to retrieve events from 20140429 pretty fast. Then I would split
>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
>> `$random
>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
>>>> until I've got all the expected results.
>>>> So in term of performances this looks "a little bit" faster than your
>> 2^32
>>>> randomization.
>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>> static.void.dev@gmail.com>wrote:
>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>>>> time series data (20140501, 20140502...).  We can prefix all of the
>>>>> keys with 4 random bytes and then just skip these during scanning. Is
>>>>> that correct? These *seems* like it will work but Im questioning the
>>>>> performance of this even if it does work.
>>>>> Also, is this available via the rest client, shell and/or thrift
>> client?
>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>> --
>>>> Adrien Mogenet
>>>> http://www.borntosegfault.com

View raw message