hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Questions on FuzzyRowFilter
Date Sun, 18 May 2014 18:16:07 GMT
@Mike,

The biggest problem is you're not listening. Please actually read my
response (and you'll understand the what we're calling "salting" is not a
random seed).

Phoenix already has secondary indexes in two flavors: one optimized for
write-once data and one more general for fully mutable data. Soon we'll
have a third for local indexing.

James


On Sun, May 18, 2014 at 10:27 AM, Michael Segel
<michael_segel@hotmail.com>wrote:

> @James,
>
> I know and that’s the biggest problem.
> Salts by definition are random seeds.
>
> Now I have two new phrases.
>
> 1) We want to remain on a sodium free diet.
> 2) Learn to kick the bucket.
>
> When you have data that is coming in on a time series, is the data mutable
> or not?
>
> A better approach would be to redesign a second type of storage to handle
> serial data and how the regions are split and managed.
> Or just not use HBase to store the underlying data in the first place and
> just store the index… ;-)
> (Yes, I thought about this too.)
>
> -Mike
>
> On May 16, 2014, at 7:50 PM, James Taylor <jtaylor@salesforce.com> wrote:
>
> > Hi Mike,
> > I agree with you - the way you've outlined is exactly the way Phoenix has
> > implemented it. It's a bit of a problem with terminology, though. We call
> > it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
> > key, mod the hash with the SALT_BUCKET value you provide, and prepend the
> > row key with this single byte value. Maybe you can coin a good term for
> > this technique?
> >
> > FWIW, you don't lose the ability to do a range scan when you salt (or
> > hash-the-key and mod by the number of "buckets"), but you do need to run
> a
> > scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
> > the client does a merge sort among these scans. It performs well.
> >
> > Thanks,
> > James
> >
> >
> > On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> >> 3+ Years on and a bad idea is being propagated again.
> >>
> >> Now repeat after me… DO NO USE A SALT.
> >>
> >> Having a low sodium diet, especially for HBase is really good for your
> >> health and sanity.
> >>
> >> The salt is going to be orthogonal to the row key (Key).
> >> There is no relationship to the specific Key.
> >>
> >> Using a salt means you now use the ability to randomly spread the
> >> distribution of data to avoid HOT SPOTTING.
> >> However you lose the ability to seek for a specific row.
> >>
> >> YOU HASH THE KEY.
> >>
> >> The hash whether you use SHA-1 or MD-5 is going to yield the same result
> >> each and every time you provide the key.
> >>
> >> But wait, the generated hash is 160 bits long. We don’t need that!
> >> Absolutely true if you just want to randomize the key to avoid hot
> >> spotting. There’s this concept called truncating the hash to the desired
> >> length.
> >> So to Adrien’s point, you can truncate it to a single byte which would
> be
> >> sufficient….
> >> Now when you want to seek for a specific row, you can find it.
> >>
> >> The downside to any solution is that you lose the ability to do a range
> >> scan.
> >> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
> >> SINGLE ROW VIA A get() CALL.
> >>
> >> <rant>
> >> This simple fact has been pointed out several years ago, yet for some
> >> reason, the use of a salt persists.
> >> I’ve actually made that part of the HBase course I wrote and use it in
> my
> >> presentation(s) on HBase.
> >>
> >> It amazes me that the committers and regulars who post here still don’t
> >> grok the fact that if you’re going to ‘SALT’ a row, you might as well
> not
> >> use HBase and stick with Hive.
> >> I remember Ed C’s rant about how preferential treatment on Hive patches
> >> was given to vendors’ committers… that preferential treatment seems to
> also
> >> be extended speakers at conferences. It wouldn’t be a problem if those
> said
> >> speakers actually knew the topic… ;-)
> >>
> >> Propagation of bad ideas means that you’re leaving a lot of performance
> on
> >> the table and it can kill or cripple projects.
> >>
> >> </rant>
> >>
> >> Sorry for the rant…
> >>
> >> -Mike
> >>
> >>
> >>
> >>
> >> On May 3, 2014, at 4:39 PM, Software Dev <static.void.dev@gmail.com>
> >> wrote:
> >>
> >>> Ok so there is no way around the FuzzyRowFilter checking every single
> >>> row in the table correct? If so, what is a valid use case for that
> >>> filter?
> >>>
> >>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> >>> client for accessing these tables is a Rails (not JRuby) application
> >>> so we are stuck with either the Thrift or Rails client. Can either of
> >>> these perform multiple gets/scans?
> >>>
> >>>
> >>>
> >>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
> adrien.mogenet@gmail.com>
> >> wrote:
> >>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
> >> be
> >>>> split enough among all the possible regions, but you won't be able to
> >>>> easily benefit from distributed scans to gather what you want.
> >>>>
> >>>> Let say you want to split (time+login) with a salted key and you
> expect
> >> to
> >>>> be able to retrieve events from 20140429 pretty fast. Then I would
> split
> >>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
> >> `$random
> >>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
> 10
> >>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
> everything
> >>>> until I've got all the expected results.
> >>>>
> >>>> So in term of performances this looks "a little bit" faster than your
> >> 2^32
> >>>> randomization.
> >>>>
> >>>>
> >>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
> >> static.void.dev@gmail.com>wrote:
> >>>>
> >>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of
our
> >>>>> time series data (20140501, 20140502...).  We can prefix all of
the
> >>>>> keys with 4 random bytes and then just skip these during scanning.
Is
> >>>>> that correct? These *seems* like it will work but Im questioning
the
> >>>>> performance of this even if it does work.
> >>>>>
> >>>>> Also, is this available via the rest client, shell and/or thrift
> >> client?
> >>>>>
> >>>>> Also, is there a FuzzyColumn equivalent of this feature?
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Adrien Mogenet
> >>>> http://www.borntosegfault.com
> >>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message