Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C465B918B for ; Thu, 2 Aug 2012 22:58:11 +0000 (UTC) Received: (qmail 93110 invoked by uid 500); 2 Aug 2012 22:58:09 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 93067 invoked by uid 500); 2 Aug 2012 22:58:09 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 93057 invoked by uid 99); 2 Aug 2012 22:58:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Aug 2012 22:58:09 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alex.baranov.v@gmail.com designates 209.85.161.169 as permitted sender) Received: from [209.85.161.169] (HELO mail-gg0-f169.google.com) (209.85.161.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Aug 2012 22:58:05 +0000 Received: by ggm4 with SMTP id 4so102860ggm.14 for ; Thu, 02 Aug 2012 15:57:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MJ0KCGSlUIo7L6ZJJF6k8AZ2kfepiSnwQSJXAR/OE+Y=; b=fiVBwGOkYU2NXqBac8Hzoo4qt7qd2KqWYvIEaGgXL/Z011ei4jzSg+bxrUDR/eMYwA a3qu90MuxczkdjlVoNTz4t6s0Q/y94bwZqXDE8tfVQCuZeeAkvzfkhsc4CXEHxm+qCgB NBSoXC4FPbz328cpIXVAuzLN4plSpDAfjfNFzj2mvHOmEAGaqqF04SqzmJRyPEQBQKZx P2oF4zB2khd09xsVay+FVo5S+03334+8IJVBMBzLZcXV6CruXmyt8QdQQDMjcnZUe04t 8qoXSETQIDhytgvCr1ODCUhE5Lk+/2EtF28IM2GY36fvwdVAhnIaZWFrd9po+sR1FV/o LOGg== MIME-Version: 1.0 Received: by 10.50.6.197 with SMTP id d5mr6465510iga.44.1343948264307; Thu, 02 Aug 2012 15:57:44 -0700 (PDT) Received: by 10.50.111.103 with HTTP; Thu, 2 Aug 2012 15:57:44 -0700 (PDT) In-Reply-To: <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com> References: <1343748460.12346.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910199.66686.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910525.89654.YahooMailNeo@web171501.mail.ir2.yahoo.com> <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com> Date: Thu, 2 Aug 2012 18:57:44 -0400 Message-ID: Subject: Re: How to query by rowKey-infix From: Alex Baranau To: user@hbase.apache.org, =?ISO-8859-1?Q?Christian_Sch=E4fer?= Content-Type: multipart/alternative; boundary=e89a8f502dd0e3fd5904c650562b X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f502dd0e3fd5904c650562b Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Christian! If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course. 1. > Is there a more elegant way to collect rows within time range X? > (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner. 2. If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard" in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows: * suppose your userId is 5 characters in length * suppose you are scanning for records with time between 2012-08-01 and 2012-08-08 * when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >=3D 2012-08-09)= . As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_H= INT I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase. Hope this helps, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 8:40 AM, Christian Sch=E4fer w= rote: > > > Excuse my double posting. > Here is the complete mail: > > > OK, > > at first I will try the scans. > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) > to be able to use coprocessors. > > > Currently I'm stuck at the scans because it requires two steps (therefore > maybe some kind of filter chaining is required) > > > The key: userId-dateInMillis-sessionId > > At first I need to extract dateInMllis with regex or substring (using > special delimiters for date) > > Second, the extracted value must be parsed to Long and set to a RowFilter > Comparator like this: > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > > How to chain that? > Do I have to write a custom filter? > (Would like to avoid that due to deployment) > > regards > Chris > > ----- Urspr=FCngliche Message ----- > Von: Michael Segel > An: user@hbase.apache.org > CC: > Gesendet: 13:52 Mittwoch, 1.August 2012 > Betreff: Re: How to query by rowKey-infix > > Actually w coprocessors you can create a secondary index in short order. > Then your cost is going to be 2 fetches. Trying to do a partial table sca= n > will be more expensive. > > On Jul 31, 2012, at 12:41 PM, Matt Corgan wrote: > > > When deciding between a table scan vs secondary index, you should try t= o > > estimate what percent of the underlying data blocks will be used in the > > query. By default, each block is 64KB. > > > > If each user's data is small and you are fitting multiple users per > block, > > then you're going to need all the blocks, so a tablescan is better > because > > it's simpler. If each user has 1MB+ data then you will want to pick ou= t > > the individual blocks relevant to each date. The secondary index will > help > > you go directly to those sparse blocks, but with a cost in complexity, > > consistency, and extra denormalized data that knocks primary data out o= f > > your block cache. > > > > If latency is not a concern, I would start with the table scan. If > that's > > too slow you add the secondary index, and if you still need it faster y= ou > > do the primary key lookups in parallel as Jerry mentions. > > > > Matt > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam > wrote: > > > >> Hi Chris: > >> > >> I'm thinking about building a secondary index for primary key lookup, > then > >> query using the primary keys in parallel. > >> > >> I'm interested to see if there is other option too. > >> > >> Best Regards, > >> > >> Jerry > >> > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Sch=E4fer < > syrious3000@yahoo.de > >>> wrote: > >> > >>> Hello there, > >>> > >>> I designed a row key for queries that need best performance (~100 ms) > >>> which looks like this: > >>> > >>> userId-date-sessionId > >>> > >>> These queries(scans) are always based on a userId and sometimes > >>> additionally on a date, too. > >>> That's no problem with the key above. > >>> > >>> However, another kind of queries shall be based on a given time range > >>> whereas the outermost left userId is not given or known. > >>> In this case I need to get all rows covering the given time range wit= h > >>> their date to create a daily reporting. > >>> > >>> As I can't set wildcards at the beginning of a left-based index for t= he > >>> scan, > >>> I only see the possibility to scan the index of the whole table to > >> collect > >>> the > >>> rowKeys that are inside the timerange I'm interested in. > >>> > >>> Is there a more elegant way to collect rows within time range X? > >>> (Unfortunately, the date attribute is not equal to the timestamp that > is > >>> stored by hbase automatically.) > >>> > >>> Could/should one maybe leverage some kind of row key caching to > >> accelerate > >>> the collection process? > >>> Is that covered by the block cache? > >>> > >>> Thanks in advance for any advice. > >>> > >>> regards > >>> Chris > >>> > >> > --=20 Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr --e89a8f502dd0e3fd5904c650562b--