Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA87DD39A for ; Wed, 22 Aug 2012 18:43:46 +0000 (UTC) Received: (qmail 85025 invoked by uid 500); 22 Aug 2012 18:43:44 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 84984 invoked by uid 500); 22 Aug 2012 18:43:44 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 84974 invoked by uid 99); 22 Aug 2012 18:43:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 18:43:44 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com designates 209.85.213.169 as permitted sender) Received: from [209.85.213.169] (HELO mail-yx0-f169.google.com) (209.85.213.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 18:43:40 +0000 Received: by yenl1 with SMTP id l1so1116788yen.14 for ; Wed, 22 Aug 2012 11:43:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=NDplwEiI6ykzcmT0jEQBGnynQIJl0tY/GUJkSdemhV8=; b=ja5SNPTKQNxp12gLQqs8ENFJOpWwEl7XeMrHqEQ3YJpB4eHVNyg/xbKloKbM2ux3ql cRWpKvQNPg69a8Y926cM4aQyC/XZ5Gx6Zc7YVsSCI6E1Gq7MHwrMFa/Y2cXA91jx98Kr oTiTh+ulvuNoJF/4kk63A1S8z6Ir1Yl6kWXLLUqVhwujnpL6tLEbi+mFA30Dze/h42GT UMobc4B4MRgGA7UZjhdPxy5rjH6pxtn50Hhy82QnG76kM9M7mVqNJy+is7Y4+vbYczgl Wok/Ui3dAop4OVyLWSeWZBw6umISnOVDoe8o/4HkIMUM31TQSxnk64zd22AulXJftq9A Rf0w== Received: by 10.50.182.138 with SMTP id ee10mr3303715igc.26.1345660998504; Wed, 22 Aug 2012 11:43:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.37.106 with HTTP; Wed, 22 Aug 2012 11:42:57 -0700 (PDT) In-Reply-To: <1344545723.72438.YahooMailNeo@web171501.mail.ir2.yahoo.com> References: <1343748460.12346.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910199.66686.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910525.89654.YahooMailNeo@web171501.mail.ir2.yahoo.com> <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com> <1343985814.76734.YahooMailNeo@web171505.mail.ir2.yahoo.com> <1344545513.36241.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1344545723.72438.YahooMailNeo@web171501.mail.ir2.yahoo.com> From: anil gupta Date: Wed, 22 Aug 2012 11:42:57 -0700 Message-ID: Subject: Re: How to query by rowKey-infix To: user@hbase.apache.org, =?ISO-8859-1?Q?Christian_Sch=E4fer?= Content-Type: multipart/alternative; boundary=14dae9340f5bcdcf6a04c7df1d1f X-Virus-Checked: Checked by ClamAV on apache.org --14dae9340f5bcdcf6a04c7df1d1f Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Christian, I had the similar requirements as yours. So, till now i have used timestamps for filtering the data and I would say the performance is satisfactory. Here are the results of timestamp based filtering: The table has 34 million records(average row size is 1.21 KB), in 136 seconds i get the entire result of query which had 225 rows. I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up is hosting 2 Slaves Instance(2 VM's running Datanode, NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't done any modification in the block size of HDFS or HBase. Considering the below-par hardware configuration of cluster i feel the performance is OK and IMO it'll be better than substring comparator of column values since in substring comparator filter you are essentially doing a FULL TABLE scan. Whereas, in timerange based scan you can *Skip Store Files*. On a side note, Alex created a JIRA for enhancing the current FuzzyRowFilter to do range based filtering also. Here is the link: https://issues.apache.org/jira/browse/HBASE-6618 . You are more than welcome if you would like to chime in. HTH, Anil Gupta On Thu, Aug 9, 2012 at 1:55 PM, Christian Sch=E4fer w= rote: > Nice. Thanks Alex for sharing your experiences with that custom filter > implementation. > > > Currently I'm still using key filter with substring comparator. > As soon as I got a good amount of test data I will measure performance of > that naiive substring filter in comparison to your fuzzy row filter. > > regards, > Christian > > > > ________________________________ > Von: Alex Baranau > An: user@hbase.apache.org; Christian Sch=E4fer > Gesendet: 22:18 Donnerstag, 9.August 2012 > Betreff: Re: How to query by rowKey-infix > > > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will > add documentation to HBase book very soon [1] > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] https://issues.apache.org/jira/browse/HBASE-6526 > > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau > wrote: > > Good! > > > > > >Submitted initial patch of fuzzy row key filter at > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the > filter class and include it in your code and use it in your setup as any > other custom filter (no need to patch HBase). > > > > > >Please let me know if you try it out (or post your comments at > HBASE-6509). > > > > > >Alex Baranau > >------ > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Sol= r > > > > > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Sch=E4fer > wrote: > > > >Hi Alex, > >> > >>thanks a lot for the hint about setting the timestamp of the put. > >>I didn't know that this would be possible but that's solving the proble= m > (first test was successful). > >>So I'm really glad that I don't need to apply a filter to extract the > time and so on for every row. > >> > >>Nevertheless I would like to see your custom filter implementation. > >>Would be nice if you could provide it helping me to get a bit into it. > >> > >>And yes that helped :) > >> > >>regards > >>Chris > >> > >> > >> > >>________________________________ > >>Von: Alex Baranau > >>An: user@hbase.apache.org; Christian Sch=E4fer > >>Gesendet: 0:57 Freitag, 3.August 2012 > >> > >>Betreff: Re: How to query by rowKey-infix > >> > >> > >>Hi Christian! > >>If to put off secondary indexes and assume you are going with "heavy > scans", you can try two following things to make it much faster. If this = is > appropriate to your situation, of course. > >> > >>1. > >> > >>> Is there a more elegant way to collect rows within time range X? > >>> (Unfortunately, the date attribute is not equal to the timestamp that > is stored by hbase automatically.) > >> > >>Can you set timestamp of the Puts to the one you have in row key? > Instead of relying on the one that HBase puts automatically (current ts). > If you can, this will improve reading speed a lot by setting time range o= n > scanner. Depending on how you are writing your data of course, but I assu= me > that you mostly write data in "time-increasing" manner. > >> > >> > >>2. > >> > >>If your userId has fixed length, or you can change it so that it has > fixed length, then you can actually use smth like "wildcard" in row key. > There's a way in Filter implementation to fast-forward to the record with > specific row key and by doing this skip many records. This might be used = as > follows: > >>* suppose your userId is 5 characters in length > >>* suppose you are scanning for records with time between 2012-08-01 > and 2012-08-08 > >>* when you scanning records and you face e.g. key > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > Because you know that all remained records of user "aaaaa" don't fall int= o > the interval you need (as the time for its records will be >=3D 2012-08-0= 9). > >> > >>As of now, I believe you will have to implement your custom filter to d= o > that. > Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING= _HINT > >>I believe I implemented similar thing some time ago. If this idea works > for you I could look for the implementation and share it if it helps. Or > may be even simply add it to HBase codebase. > >> > >>Hope this helps, > >> > >> > >>Alex Baranau > >>------ > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > >> > >> > >> > >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Sch=E4fer > wrote: > >> > >> > >>> > >>>Excuse my double posting. > >>>Here is the complete mail: > >>> > >>> > >>> > >>>OK, > >>> > >>>at first I will try the scans. > >>> > >>>If that's too slow I will have to upgrade hbase (currently > 0.90.4-cdh3u2) to be able to use coprocessors. > >>> > >>> > >>>Currently I'm stuck at the scans because it requires two steps > (therefore maybe some kind of filter chaining is required) > >>> > >>> > >>>The key: userId-dateInMillis-sessionId > >>> > >>> > >>>At first I need to extract dateInMllis with regex or substring (using > special delimiters for date) > >>> > >>>Second, the extracted value must be parsed to Long and set to a > RowFilter Comparator like this: > >>> > >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > >>> > >>>How to chain that? > >>>Do I have to write a custom filter? > >>>(Would like to avoid that due to deployment) > >>> > >>>regards > >>>Chris > >>> > >>> > >>>----- Urspr=FCngliche Message ----- > >>>Von: Michael Segel > >>>An: user@hbase.apache.org > >>>CC: > >>>Gesendet: 13:52 Mittwoch, 1.August 2012 > >>>Betreff: Re: How to query by rowKey-infix > >>> > >>>Actually w coprocessors you can create a secondary index in short orde= r. > >>>Then your cost is going to be 2 fetches. Trying to do a partial table > scan will be more expensive. > >>> > >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan wrote: > >>> > >>>> When deciding between a table scan vs secondary index, you should tr= y > to > >>>> estimate what percent of the underlying data blocks will be used in > the > >>>> query. By default, each block is 64KB. > >>>> > >>>> If each user's data is small and you are fitting multiple users per > block, > >>>> then you're going to need all the blocks, so a tablescan is better > because > >>>> it's simpler. If each user has 1MB+ data then you will want to pick > out > >>>> the individual blocks relevant to each date. The secondary index > will help > >>>> you go directly to those sparse blocks, but with a cost in complexit= y, > >>>> consistency, and extra denormalized data that knocks primary data ou= t > of > >>>> your block cache. > >>>> > >>>> If latency is not a concern, I would start with the table scan. If > that's > >>>> too slow you add the secondary index, and if you still need it faste= r > you > >>>> do the primary key lookups in parallel as Jerry mentions. > >>>> > >>>> Matt > >>>> > >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam > wrote: > >>>> > >>>>> Hi Chris: > >>>>> > >>>>> I'm thinking about building a secondary index for primary key > lookup, then > >>>>> query using the primary keys in parallel. > >>>>> > >>>>> I'm interested to see if there is other option too. > >>>>> > >>>>> Best Regards, > >>>>> > >>>>> Jerry > >>>>> > >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Sch=E4fer < > syrious3000@yahoo.de > >>>>>> wrote: > >>>>> > >>>>>> Hello there, > >>>>>> > >>>>>> I designed a row key for queries that need best performance (~100 > ms) > >>>>>> which looks like this: > >>>>>> > >>>>>> userId-date-sessionId > >>>>>> > >>>>>> These queries(scans) are always based on a userId and sometimes > >>>>>> additionally on a date, too. > >>>>>> That's no problem with the key above. > >>>>>> > >>>>>> However, another kind of queries shall be based on a given time > range > >>>>>> whereas the outermost left userId is not given or known. > >>>>>> In this case I need to get all rows covering the given time range > with > >>>>>> their date to create a daily reporting. > >>>>>> > >>>>>> As I can't set wildcards at the beginning of a left-based index fo= r > the > >>>>>> scan, > >>>>>> I only see the possibility to scan the index of the whole table to > >>>>> collect > >>>>>> the > >>>>>> rowKeys that are inside the timerange I'm interested in. > >>>>>> > >>>>>> Is there a more elegant way to collect rows within time range X? > >>>>>> (Unfortunately, the date attribute is not equal to the timestamp > that is > >>>>>> stored by hbase automatically.) > >>>>>> > >>>>>> Could/should one maybe leverage some kind of row key caching to > >>>>> accelerate > >>>>>> the collection process? > >>>>>> Is that covered by the block cache? > >>>>>> > >>>>>> Thanks in advance for any advice. > >>>>>> > >>>>>> regards > >>>>>> Chris > >>>>>> > >>>>> > >>> > >> > >> > >>-- > >> > >>Alex Baranau > >>------ > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > >> > > > > > > > >-- > > > >Alex Baranau > >------ > >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > > > --=20 Thanks & Regards, Anil Gupta --14dae9340f5bcdcf6a04c7df1d1f--