Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of alex.baranov.v@gmail.com
 designates 209.85.214.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU0-SMTP448FB95D6A60BDFFCF6484E8FBB0@phx.gbl>
References: 
 <CAA7+SiDSCy-r-MWY_VmEPEeCBqHGkdXzw1REexgNFp_Fpc1QrQ@mail.gmail.com>
	<BLU0-SMTP448FB95D6A60BDFFCF6484E8FBB0@phx.gbl>
Date: Sat, 18 Aug 2012 15:13:59 -0400
Message-ID: 
 <CAA7+SiDhy+vhpV1EZpQWysPoqp2MfdH1AU5kegaDfLLU+Avh6A@mail.gmail.com>
Subject: Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
From: Alex Baranau <alex.baranov.v@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=e89a8f2352ad35da8104c78f1444

--e89a8f2352ad35da8104c78f1444
Content-Type: text/plain; charset=ISO-8859-1

@Michael,

This is not a simple partial key scan. Take this example of rows:

aaaaa_100001_20120801
aaaaa_100001_20120802
aaaaa_100001_20120802
aaaaa_100001_20120803
aaaaa_100001_20120804
aaaaa_100001_20120805
aaaaa_100002_20120801
aaaaa_100002_20120802
aaaaa_100002_20120802
aaaaa_100002_20120803
aaaaa_100002_20120804
aaaaa_100002_20120805

where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
the query is to select actions in the range 20120803-20120805 (in this case
last 3 days), then when scan encounters row:

aaaaa_100001_20120801

it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
skip some records (in practice, this may mean skipping really a LOT of
recrods).


@Anil,

> Sample Query: I want to get all the event which happened in last month.

1. What other queries do you do? Just trying to understand why this row key
format was chosen.

2. Can you set timestamp on Puts the same as timestamp "assigned" to your
record by app logic? If you can, then this is the first thing to try and
perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending
on how you write the data this may help a lot with the reading speed by ts,
because that way you may skip the whole HFiles from reading based on ts. I
don't know about your data a lot to judge, but:
  * in case you have not a lot of users most of which are with long history
of interaction with you system (i.e. there are a lot of records for
specific "userX_actionY") and
  * if you write data with monotonically increasing timestamp
  * your regions are not too big
then this might help you, as it will increase the chance that some of the
HFiles will contain data *all of which* doesn't fell into the time interval
you select by. Otherwise, if written data items with different timestamps
are very well spread across the HFiles the chance that some HFiles are
skipped from reading is very small. I believe Lars George has illustrated
this in one of his presentations, but couldn't find it quickly.

> something like FuzzyRowFilter with range

Yes, smth like this looks like would be very valuable. It would be
interesting to implement too. Let's see if I find the time for that in my
work plan. If you want to try it by yourself, go for it! Let me know if you
need a help in that case ;)

Alex Baranau
------
Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> What row keys are you skipping?
>
> Using your example...
> You have a start row of 00000000200, and an end key of
> xFFxFFxFFxFFxFFxFF00350.
> Note that you could also write that end key as xFF(1..6) 01 since it looks
> like you're trying to match the 00 in positons 7 and 8 of your numeric
> string.
>
> Assuming that when you say ? you mean that you expect to have a character
> in that spot and that your row key is exactly 11 characters in length.
>
> While you may not return all the rows in that range, you do have to still
> check the row key, unless I am missing something.
>
> So what am I missing?
>
> On Aug 17, 2012, at 3:42 PM, Alex Baranau <alex.baranov.v@gmail.com>
> wrote:
>
> > There was a question [1] in
> > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> > more sense to answer it here.
> >
> > With the current FuzzyRowFilter I believe the only way to approach the
> > problem is to add 150 fuzzy rules to the filter: ??????00200,
> ??????00201,
> > ..., ??????00350.
> >
> > As for performance of this approach I can say the following:
> > * there are two "checks" happening for each processed row key (i.e. those
> > row keys we don't skip)
> > * first one performs simple check if the given row key satisfies the
> fuzzy
> > rule and also determines if there's next row key to advance to (if this
> one
> > doesn't satisfy). The check takes up at max O(n), where n is the length
> of
> > fuzzy rule. I.e. this is done in one simple loop, which can be broken
> > before all bytes are checked. For m rules this will be O(m*n).
> > * second piece calculates the next row key to provide it as a hint for
> > fast-forwarding. We again check all rules and finding the smallest hint.
> > Operation is also done in one loop, i.e. O(m*n) here as well.
> >
> > With 150 fuzzy rules of length 11, the applying filter is equivalent to
> the
> > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> > lot, but can work quite fast. So I'd just try it.
> >
> > As for extension which will be more efficient, it makes sense to consider
> > implementing it. Let me think more about it and get back with the JIRA
> > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter
> first.
> > The output (performance) would give us some food for thinking, or may be
> > even turns out to be acceptable for you (hopefully).
> >
> >> Can i run this kind of filter on HBase0.92 without doing any significant
> > update to the cluster
> >
> > Until the next release, you'll have to use the FuzzyRowFilter as any
> other
> > custom filter. Just grab the patch from HBASE-6509 and copy the filter.
> No
> > need to patch & rebuild HBase.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > [1]
> >
> > Anil Gupta added a comment - 18/Aug/12 04:37
> > Hi Alex,
> > I have a question related to this filter. I have a similar filtering
> > requirement which will be an extension to FuzzyFilterRow.
> > Suppose, i have the following structure of rowkeys: userid_actionid,
> where
> > userid is of 6 digit and then actionid is 5 digit. I would like to get
> all
> > the rows with actionid between 00200 to 00350. With current
> FuzzyRowFilter
> > i can search for all the rows a particular actionid. Instead of searching
> > for a particular actionid i would like to search for a range of actionid.
> > Does this use case sounds like an extension to current FuzzyRowFilter?
> Can
> > i run this kind of filter on HBase0.92 without doing any significant
> update
> > to the cluster. If i develop this kind of filter then what is needed to
> run
> > it on all the RS's?
> > Thanks,
> > Anil
>
>

--e89a8f2352ad35da8104c78f1444--