Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC3E3DDF6 for ; Sat, 18 Aug 2012 19:14:30 +0000 (UTC) Received: (qmail 60367 invoked by uid 500); 18 Aug 2012 19:14:28 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 60321 invoked by uid 500); 18 Aug 2012 19:14:28 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 60313 invoked by uid 99); 18 Aug 2012 19:14:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Aug 2012 19:14:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of alex.baranov.v@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-ob0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Aug 2012 19:14:21 +0000 Received: by obhx4 with SMTP id x4so8965599obh.14 for ; Sat, 18 Aug 2012 12:14:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Sq8ZfLEvMnvxv59/oIKUtHhj8Swdz9HBbFJYtMb+Fks=; b=bWy/AbPvhKmwtL7kse3FmKFzZPx/wHTHpRDTgfYcdsTt9GJNlI1ZXtnQbcUJE1lfV9 MQLEG3DhpNjQSmIvmxkY82azeQApTr7ymdPuw1lA/I2Heu9ZJPLhbA2H7Y0VFOEwVlA9 omXlp7O/FhLsRiK2o5VVW3/dI77j1aEPK1ezZBoOAIaDiJC5I44VkXkRsuBENqtUPEyN eY8F4MBd8V6qB8aCrUQNO3EzF6tMM12dYaP3SCfEkOTa0LLYQVj2CUGlAXki7jcjZY/e 4Uj7FzAz8U+nROdVZSdvGdgggZ3JvzuwetAlH9hb+jiDsR/KHz6yvtHTBiRlM4/k8Zm7 MM2Q== MIME-Version: 1.0 Received: by 10.50.88.229 with SMTP id bj5mr5439228igb.21.1345317240166; Sat, 18 Aug 2012 12:14:00 -0700 (PDT) Received: by 10.50.91.164 with HTTP; Sat, 18 Aug 2012 12:13:59 -0700 (PDT) In-Reply-To: References: Date: Sat, 18 Aug 2012 15:13:59 -0400 Message-ID: Subject: Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter? From: Alex Baranau To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=e89a8f2352ad35da8104c78f1444 --e89a8f2352ad35da8104c78f1444 Content-Type: text/plain; charset=ISO-8859-1 @Michael, This is not a simple partial key scan. Take this example of rows: aaaaa_100001_20120801 aaaaa_100001_20120802 aaaaa_100001_20120802 aaaaa_100001_20120803 aaaaa_100001_20120804 aaaaa_100001_20120805 aaaaa_100002_20120801 aaaaa_100002_20120802 aaaaa_100002_20120802 aaaaa_100002_20120803 aaaaa_100002_20120804 aaaaa_100002_20120805 where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If the query is to select actions in the range 20120803-20120805 (in this case last 3 days), then when scan encounters row: aaaaa_100001_20120801 it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and skip some records (in practice, this may mean skipping really a LOT of recrods). @Anil, > Sample Query: I want to get all the event which happened in last month. 1. What other queries do you do? Just trying to understand why this row key format was chosen. 2. Can you set timestamp on Puts the same as timestamp "assigned" to your record by app logic? If you can, then this is the first thing to try and perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending on how you write the data this may help a lot with the reading speed by ts, because that way you may skip the whole HFiles from reading based on ts. I don't know about your data a lot to judge, but: * in case you have not a lot of users most of which are with long history of interaction with you system (i.e. there are a lot of records for specific "userX_actionY") and * if you write data with monotonically increasing timestamp * your regions are not too big then this might help you, as it will increase the chance that some of the HFiles will contain data *all of which* doesn't fell into the time interval you select by. Otherwise, if written data items with different timestamps are very well spread across the HFiles the chance that some HFiles are skipped from reading is very small. I believe Lars George has illustrated this in one of his presentations, but couldn't find it quickly. > something like FuzzyRowFilter with range Yes, smth like this looks like would be very valuable. It would be interesting to implement too. Let's see if I find the time for that in my work plan. If you want to try it by yourself, go for it! Let me know if you need a help in that case ;) Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel wrote: > What row keys are you skipping? > > Using your example... > You have a start row of 00000000200, and an end key of > xFFxFFxFFxFFxFFxFF00350. > Note that you could also write that end key as xFF(1..6) 01 since it looks > like you're trying to match the 00 in positons 7 and 8 of your numeric > string. > > Assuming that when you say ? you mean that you expect to have a character > in that spot and that your row key is exactly 11 characters in length. > > While you may not return all the rows in that range, you do have to still > check the row key, unless I am missing something. > > So what am I missing? > > On Aug 17, 2012, at 3:42 PM, Alex Baranau > wrote: > > > There was a question [1] in > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes > > more sense to answer it here. > > > > With the current FuzzyRowFilter I believe the only way to approach the > > problem is to add 150 fuzzy rules to the filter: ??????00200, > ??????00201, > > ..., ??????00350. > > > > As for performance of this approach I can say the following: > > * there are two "checks" happening for each processed row key (i.e. those > > row keys we don't skip) > > * first one performs simple check if the given row key satisfies the > fuzzy > > rule and also determines if there's next row key to advance to (if this > one > > doesn't satisfy). The check takes up at max O(n), where n is the length > of > > fuzzy rule. I.e. this is done in one simple loop, which can be broken > > before all bytes are checked. For m rules this will be O(m*n). > > * second piece calculates the next row key to provide it as a hint for > > fast-forwarding. We again check all rules and finding the smallest hint. > > Operation is also done in one loop, i.e. O(m*n) here as well. > > > > With 150 fuzzy rules of length 11, the applying filter is equivalent to > the > > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a > > lot, but can work quite fast. So I'd just try it. > > > > As for extension which will be more efficient, it makes sense to consider > > implementing it. Let me think more about it and get back with the JIRA > > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter > first. > > The output (performance) would give us some food for thinking, or may be > > even turns out to be acceptable for you (hopefully). > > > >> Can i run this kind of filter on HBase0.92 without doing any significant > > update to the cluster > > > > Until the next release, you'll have to use the FuzzyRowFilter as any > other > > custom filter. Just grab the patch from HBASE-6509 and copy the filter. > No > > need to patch & rebuild HBase. > > > > Alex Baranau > > ------ > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > > > [1] > > > > Anil Gupta added a comment - 18/Aug/12 04:37 > > Hi Alex, > > I have a question related to this filter. I have a similar filtering > > requirement which will be an extension to FuzzyFilterRow. > > Suppose, i have the following structure of rowkeys: userid_actionid, > where > > userid is of 6 digit and then actionid is 5 digit. I would like to get > all > > the rows with actionid between 00200 to 00350. With current > FuzzyRowFilter > > i can search for all the rows a particular actionid. Instead of searching > > for a particular actionid i would like to search for a range of actionid. > > Does this use case sounds like an extension to current FuzzyRowFilter? > Can > > i run this kind of filter on HBase0.92 without doing any significant > update > > to the cluster. If i develop this kind of filter then what is needed to > run > > it on all the RS's? > > Thanks, > > Anil > > --e89a8f2352ad35da8104c78f1444--