hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: row filter - binary comparator at certain range
Date Tue, 22 Oct 2013 03:54:06 GMT
One thing I neglected to mention is that the table is pre-split at the
"prepending-row-key-with-single-hashed-byte" boundaries, so the expectation
is that you'd allocate enough buckets that you don't end up needing to
splitting the regions. But if you under allocate (i.e. allocate too small a
SALT_BUCKETS value), then I see your point.

Thanks,
James


On Mon, Oct 21, 2013 at 5:58 PM, Michael Segel <michael_segel@hotmail.com>wrote:

> James,
>
> Its evenly distributed, however... because its a time stamp, its a 'tail
> end charlie' addition.
> So when you split a region, the top half is never added to, so you end up
> with all regions half filled except for the last region in each 'modded'
> value.
>
> I wouldn't say its a bad thing if you plan for it.
>
> On Oct 21, 2013, at 5:07 PM, James Taylor <jtaylor@salesforce.com> wrote:
>
> > We don't truncate the hash, we mod it. Why would you expect that data
> > wouldn't be evenly distributed? We've not seen this to be the case.
> >
> >
> >
> > On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel <
> msegel_hadoop@hotmail.com>wrote:
> >
> >> What do you call hashing the row key?
> >> Or hashing the row key and then appending the row key to the hash?
> >> Or hashing the row key, truncating the hash value to some subset and
> then
> >> appending the row key to the value?
> >>
> >> The problem is that there is specific meaning to the term salt. Re-using
> >> it here will cause confusion because you're implying something you don't
> >> mean to imply.
> >>
> >> you could say prepend a truncated hash of the key, however… is prepend a
> >> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English
> major. )
> >>
> >> So even outside of Phoenix, the concept is the same.
> >> Even with a truncated hash, you will find that over time, all but the
> tail
> >> N regions will only be half full.
> >> This could be both good and bad.
> >>
> >> (Where N is your number 8 or 16 allowable hash values.)
> >>
> >> You've solved potentially one problem… but still have other issues that
> >> you need to address.
> >> I guess the simple answer is to double the region sizes and not care
> that
> >> most of your regions will be 1/2 the max size…  but the size you really
> >> want and 8-16 regions will be up to twice as big.
> >>
> >>
> >>
> >> On Oct 21, 2013, at 3:26 PM, James Taylor <jtaylor@salesforce.com>
> wrote:
> >>
> >>> What do you think it should be called, because
> >>> "prepending-row-key-with-single-hashed-byte" doesn't have a very good
> >> ring
> >>> to it. :-)
> >>>
> >>> Agree that getting the row key design right is crucial.
> >>>
> >>> The range of "prepending-row-key-with-single-hashed-byte" is
> declarative
> >>> when you create your table in Phoenix, so you typically declare an
> upper
> >>> bound based on your cluster size (not 255, but maybe 8 or 16). We've
> run
> >>> the numbers and it's typically faster, but as with most things, not
> >> always.
> >>>
> >>> HTH,
> >>> James
> >>>
> >>>
> >>> On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel <
> >> msegel_hadoop@hotmail.com>wrote:
> >>>
> >>>> Then its not a SALT. And please don't use the term 'salt' because it
> has
> >>>> specific meaning outside to what you want it to mean.  Just like
> saying
> >>>> HBase has ACID because you write the entire row as an atomic element.
> >> But
> >>>> I digress….
> >>>>
> >>>> Ok so to your point…
> >>>>
> >>>> 1 byte == 255 possible values.
> >>>>
> >>>> So which will be faster.
> >>>>
> >>>> creating a list of the 1 byte truncated hash of each possible
> timestamp
> >> in
> >>>> your range, or doing 255 separate range scans with the start and stop
> >> range
> >>>> key set?
> >>>>
> >>>> That will give you the results you want, however… I'd go back and
have
> >>>> them possibly rethink the row key if they can … assuming this is the
> >> base
> >>>> access pattern.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Oct 21, 2013, at 11:37 AM, James Taylor <jtaylor@salesforce.com>
> >> wrote:
> >>>>
> >>>>> Phoenix restricts salting to a single byte.
> >>>>> Salting perhaps is misnamed, as the salt byte is a stable hash based
> on
> >>>> the
> >>>>> row key.
> >>>>> Phoenix's skip scan supports sub-key ranges.
> >>>>> We've found salting in general to be faster (though there are cases
> >> where
> >>>>> it's not), as it ensures better parallelization.
> >>>>>
> >>>>> Regards,
> >>>>> James
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
> >>>>> <vrodionov@carrieriq.com>wrote:
> >>>>>
> >>>>>> FuzzyRowFilter does not work on sub-key ranges.
> >>>>>> Salting is bad for any scan operation, unfortunately. When salt
> prefix
> >>>>>> cardinality is small (1-2 bytes),
> >>>>>> one can try something similar to FuzzyRowFilter but with additional
> >>>>>> sub-key range support.
> >>>>>> If salt prefix cardinality is high (> 2 bytes) - do a full
scan with
> >>>> your
> >>>>>> own Filter (for timestamp ranges).
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Vladimir Rodionov
> >>>>>> Principal Platform Engineer
> >>>>>> Carrier IQ, www.carrieriq.com
> >>>>>> e-mail: vrodionov@carrieriq.com
> >>>>>>
> >>>>>> ________________________________________
> >>>>>> From: Premal Shah [premal.j.shah@gmail.com]
> >>>>>> Sent: Sunday, October 20, 2013 10:42 PM
> >>>>>> To: user
> >>>>>> Subject: Re: row filter - binary comparator at certain range
> >>>>>>
> >>>>>> Have you looked at FuzzyRowFilter? Seems to me that it might
satisfy
> >>>> your
> >>>>>> use-case.
> >>>>>>
> >>>>>>
> >>>>
> >>
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan <duanjianmin@126.com>
> >> wrote:
> >>>>>>
> >>>>>>> Alex Vasilenko <aa.vasilenko@...> writes:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Lars,
> >>>>>>>>
> >>>>>>>> But how it will behave, when I have salt at the beginning
of the
> key
> >>>> to
> >>>>>>>> properly shard table across regions? Imagine row key
of format
> >>>>>>>> salt:timestamp and rows goes like this:
> >>>>>>>> ...
> >>>>>>>> 1:15
> >>>>>>>> 1:16
> >>>>>>>> 1:17
> >>>>>>>> 1:23
> >>>>>>>> 2:3
> >>>>>>>> 2:5
> >>>>>>>> 2:12
> >>>>>>>> 2:15
> >>>>>>>> 2:19
> >>>>>>>> 2:25
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> And I want to find all rows, that has second part (timestamp)
in
> >> range
> >>>>>>>> 15-25. What startKey and endKey should be used?
> >>>>>>>>
> >>>>>>>> Alexandr Vasilenko
> >>>>>>>> Web Developer
> >>>>>>>> Skype:menterr
> >>>>>>>> mob: +38097-611-45-99
> >>>>>>>>
> >>>>>>>> 2012/2/9 lars hofhansl <lhofhansl@...>
> >>>>>>> Hi,
> >>>>>>> Alexandr Vasilenko
> >>>>>>> Have you ever resolved this issue?i am also facing this
iusse.
> >>>>>>> i also want implement this functionality.
> >>>>>>> Imagine row key of format
> >>>>>>> salt:timestamp and rows goes like this:
> >>>>>>> ...
> >>>>>>> 1:15
> >>>>>>> 1:16
> >>>>>>> 1:17
> >>>>>>> 1:23
> >>>>>>> 2:3
> >>>>>>> 2:5
> >>>>>>> 2:12
> >>>>>>> 2:15
> >>>>>>> 2:19
> >>>>>>> 2:25
> >>>>>>> ...
> >>>>>>>
> >>>>>>> And I want to find all rows, that has second part (timestamp)
in
> >> range
> >>>>>>> 15-25.
> >>>>>>>
> >>>>>>> Could you please tell me how you resolve this ?
> >>>>>>> thanks  in advance.
> >>>>>>>
> >>>>>>>
> >>>>>>> Tony duan
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Regards,
> >>>>>> Premal Shah.
> >>>>>>
> >>>>>> Confidentiality Notice:  The information contained in this message,
> >>>>>> including any attachments hereto, may be confidential and is
> intended
> >>>> to be
> >>>>>> read only by the individual or entity to whom this message is
> >>>> addressed. If
> >>>>>> the reader of this message is not the intended recipient or
an agent
> >> or
> >>>>>> designee of the intended recipient, please note that any review,
> use,
> >>>>>> disclosure or distribution of this message or its attachments,
in
> any
> >>>> form,
> >>>>>> is strictly prohibited.  If you have received this message in
error,
> >>>> please
> >>>>>> immediately notify the sender and/or Notifications@carrieriq.comand
> >>>>>> delete or destroy any copy of this message and its attachments.
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message