Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8DC9191D2 for ; Fri, 3 Aug 2012 09:35:31 +0000 (UTC) Received: (qmail 43316 invoked by uid 500); 3 Aug 2012 09:35:29 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 43246 invoked by uid 500); 3 Aug 2012 09:35:29 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 43230 invoked by uid 99); 3 Aug 2012 09:35:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 09:35:28 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FORGED_REPLYTO,FREEMAIL_REPLYTO_END_DIGIT,FSL_FREEMAIL_1,FSL_FREEMAIL_2,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [77.238.189.62] (HELO nm5.bullet.mail.ird.yahoo.com) (77.238.189.62) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 03 Aug 2012 09:35:22 +0000 Received: from [77.238.189.56] by nm5.bullet.mail.ird.yahoo.com with NNFMP; 03 Aug 2012 09:35:00 -0000 Received: from [212.82.108.243] by tm9.bullet.mail.ird.yahoo.com with NNFMP; 03 Aug 2012 09:35:00 -0000 Received: from [127.0.0.1] by omp1008.mail.ird.yahoo.com with NNFMP; 03 Aug 2012 09:35:00 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 55875.32435.bm@omp1008.mail.ird.yahoo.com Received: (qmail 56721 invoked by uid 60001); 3 Aug 2012 09:34:59 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.de; s=s1024; t=1343986499; bh=OcMNUjLctcOcUxRQJ2KcYghQGfcsRAorI5mSAtieNNc=; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=oZeFfOOGadRi/L0+Yvym05rlRAL2Y6ABj2NBBEFbH9uEmpGSVs9qEcR0IeAgT4Imj2PEIPe/LlTab9Zx2P/bWvBi2CM+1UQijL/Go+filJfl1pdQolFUtqlhbhsKawilGiJRvO4XhZ/0TwIuTd170N4xJ3stX3YUZutBweFacWA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.de; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=bPmNDpgvkessdaGV13Uvr05a7BFosBAL5Y2LL8cTcieojxuW6KFIhIdpcSgVKt7zXprhJdcoh1kDJScWd+j2wnXFWABv/ePX+FDX475ScINScXAz8hD2aZcpqATBxfIbkFiiNFRjx/BxT6DUAC/RvcWOeF0Nq4ECzqi5dlO+ofs=; X-YMail-OSG: MGZDChUVM1nEss7CTX5ECqol3M_Kna9fmfY4fcxcxkfmfJh U7P.NF7rSPI02kiy6n2tK9KSEM5ETfyzghm_kWgp1iuS_Hx0ixjs_ylJPf0D 64mzKqrTLktAKNOFNDwehuzvgox3I808TOCcw8oJDd_PbI7LQAPbFdVPFAyO 1ZN5p7Fp6F9WtESNVpjG_gxBdFgpwWDyffJ1LMsc0kFWQ5K5RVFzku_A.hxH A7fbFxgH_shupRQQuSYAdRHx2H_eCmq8iC3SsqR8LswofSHD.W4hxfH.xg0u 8VjSFoAZivA3nZ.eF4yY5qCt82VPLnNJDaJB_BDON2anNpI_ZR85NFefYfWH 1hB_iHeVnc2DcrmjNL3ix_NWhdUeWE9EGXU5J9FwP13BCpeUJTtNNtEvyEdG dRWvw3068MPYYMMR1C38E_uIAiY0DgsPkXyqgyWnWH2viCxcwl6OHK13LTFg DxnAVxc0gTYXZVObFFdTgKNK9SUbFhQ74yrrOwOqW._ThJT1WGqsWsN1XAEk 4 Received: from [195.13.41.220] by web171504.mail.ir2.yahoo.com via HTTP; Fri, 03 Aug 2012 10:34:59 BST X-Mailer: YahooMailWebService/0.8.120.356233 References: <1343748460.12346.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910199.66686.YahooMailNeo@web171503.mail.ir2.yahoo.com> <1343910525.89654.YahooMailNeo@web171501.mail.ir2.yahoo.com> <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com> Message-ID: <1343986499.56641.YahooMailNeo@web171504.mail.ir2.yahoo.com> Date: Fri, 3 Aug 2012 10:34:59 +0100 (BST) From: =?iso-8859-1?Q?Christian_Sch=E4fer?= Reply-To: =?iso-8859-1?Q?Christian_Sch=E4fer?= Subject: Re: How to query by rowKey-infix To: "user@hbase.apache.org" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Matt,=0A=0Asure I got this in mind as an last option (at least on a limi= ted subset of data).=0A=0ADue to our estimation of some billions rows a wee= k a selective filtering needs to take place at the server side.=0A=0ABut I = agree that one could do fine filtering stuff on the client side on a handy = data subset to avoid getting the hbase schema & indexing (by coprocessors) = too complicated.=0A=0Aregards=0AChris=0A=0A=0A=0A----- Urspr=FCngliche Mess= age -----=0AVon: Matt Corgan =0AAn: user@hbase.apache.= org=0ACC: =0AGesendet: 3:29 Freitag, 3.August 2012=0ABetreff: Re: How to qu= ery by rowKey-infix=0A=0AYeah - just thought i'd point it out since people = often have small tables=0Ain their cluster alongside the big ones, and when= generating reports,=0Asometimes you don't care if it finishes in 10 minute= s vs an hour.=0A=0A=0AOn Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau wrote:=0A=0A> I think this is exactly what Christian is t= rying to (and should be trying=0A> to) avoid ;).=0A>=0A> I can't imagine us= e-case when you need to filter something and you can do=0A> it with (at lea= st) server-side filter, and yet in this situation you want=0A> to try to do= it on the client-side... Doing filtering on client-side when=0A> you can d= o it on server-side just feels wrong. Esp. given that there's a=0A> lot of = data in HBase (otherwise why would you use it).=0A>=0A> Alex Baranau=0A> --= ----=0A> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticS= earch -=0A> Solr=0A>=0A> On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan wrote:=0A>=0A> > Also Christian, don't forget you can read = all the rows back to the client=0A> > and do the filtering there using what= ever logic you like.=A0 HBase Filters=0A> > can be thought of as an optimiz= ation (predicate push-down) over=0A> client-side=0A> > filtering.=A0 Pullin= g all the rows over the network will be slower, but I=0A> > don't think we = know enough about your data or speed requirements to rule=0A> it=0A> > out.= =0A> >=0A> >=0A> > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau > >wrote:=0A> >=0A> > > Hi Christian!=0A> > >=0A> > > If= to put off secondary indexes and assume you are going with "heavy=0A> > > = scans", you can try two following things to make it much faster. If=0A> thi= s=0A> > is=0A> > > appropriate to your situation, of course.=0A> > >=0A> > = > 1.=0A> > >=0A> > > > Is there a more elegant way to collect rows within t= ime range X?=0A> > > > (Unfortunately, the date attribute is not equal to t= he timestamp that=0A> > is=0A> > > stored by hbase automatically.)=0A> > >= =0A> > > Can you set timestamp of the Puts to the one you have in row key?= =0A> Instead=0A> > > of relying on the one that HBase puts automatically (c= urrent ts). If=0A> you=0A> > > can, this will improve reading speed a lot b= y setting time range on=0A> > > scanner. Depending on how you are writing y= our data of course, but I=0A> > assume=0A> > > that you mostly write data i= n "time-increasing" manner.=0A> > >=0A> > > 2.=0A> > >=0A> > > If your user= Id has fixed length, or you can change it so that it has=0A> > fixed=0A> > = > length, then you can actually use smth like "wildcard"=A0 in row key.=0A>= > There's=0A> > > a way in Filter implementation to fast-forward to the re= cord with=0A> > specific=0A> > > row key and by doing this skip many record= s. This might be used as=0A> > follows:=0A> > > * suppose your userId is 5 = characters in length=0A> > > * suppose you are scanning for records with ti= me between 2012-08-01=0A> > > and 2012-08-08=0A> > > * when you scanning re= cords and you face e.g. key=0A> > > "aaaaa_2012-08-09_3jh345j345kjh", where= "aaaaa" is user id, you can=0A> tell=0A> > > the scanner from your filter = to fast-forward to key "aaaab_=0A> 2012-08-01".=0A> > > Because you know th= at all remained records of user "aaaaa" don't fall=0A> > into=0A> > > the i= nterval you need (as the time for its records will be >=3D=0A> > 2012-08-09= ).=0A> > >=0A> > > As of now, I believe you will have to implement your cus= tom filter to=0A> do=0A> > > that.=0A> > > Pointer:=0A> > > org.apache.hado= op.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT=0A> > > I believe I = implemented similar thing some time ago. If this idea works=0A> > for=0A> >= > you I could look for the implementation and share it if it helps. Or=0A>= may=0A> > be=0A> > > even simply add it to HBase codebase.=0A> > >=0A> > >= Hope this helps,=0A> > >=0A> > > Alex Baranau=0A> > > ------=0A> > > Semat= ext :: http://blog.sematext.com/ :: Hadoop - HBase -=0A> ElasticSearch=0A> = > -=0A> > > Solr=0A> > >=0A> > >=0A> > > On Thu, Aug 2, 2012 at 8:40 AM, Ch= ristian Sch=E4fer <=0A> syrious3000@yahoo.de=0A> > > >wrote:=0A> > >=0A> > = > >=0A> > > >=0A> > > > Excuse my double posting.=0A> > > > Here is the com= plete mail:=0A> > > >=0A> > > >=0A> > > > OK,=0A> > > >=0A> > > > at first = I will try the scans.=0A> > > >=0A> > > > If that's too slow I will have to= upgrade hbase (currently=0A> > 0.90.4-cdh3u2)=0A> > > > to be able to use = coprocessors.=0A> > > >=0A> > > >=0A> > > > Currently I'm stuck at the scan= s because it requires two steps=0A> > (therefore=0A> > > > maybe some kind = of filter chaining is required)=0A> > > >=0A> > > >=0A> > > > The key:=A0 u= serId-dateInMillis-sessionId=0A> > > >=0A> > > > At first I need to extract= dateInMllis with regex or substring (using=0A> > > > special delimiters fo= r date)=0A> > > >=0A> > > > Second, the extracted value must be parsed to L= ong and set to a=0A> > RowFilter=0A> > > > Comparator like this:=0A> > > >= =0A> > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new=0A>= > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));=0A> > > >=0A>= > > > How to chain that?=0A> > > > Do I have to write a custom filter?=0A>= > > > (Would like to avoid that due to deployment)=0A> > > >=0A> > > > reg= ards=0A> > > > Chris=0A> > > >=0A> > > > ----- Urspr=FCngliche Message ----= -=0A> > > > Von: Michael Segel =0A> > > > An: us= er@hbase.apache.org=0A> > > > CC:=0A> > > > Gesendet: 13:52 Mittwoch, 1.Aug= ust 2012=0A> > > > Betreff: Re: How to query by rowKey-infix=0A> > > >=0A> = > > > Actually w coprocessors you can create a secondary index in short=0A>= > order.=0A> > > > Then your cost is going to be 2 fetches. Trying to do a= partial table=0A> > > scan=0A> > > > will be more expensive.=0A> > > >=0A>= > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan =0A> = wrote:=0A> > > >=0A> > > > > When deciding between a table scan vs secondar= y index, you should=0A> try=0A> > > to=0A> > > > > estimate what percent of= the underlying data blocks will be used in=0A> > the=0A> > > > > query.=A0= By default, each block is 64KB.=0A> > > > >=0A> > > > > If each user's dat= a is small and you are fitting multiple users per=0A> > > > block,=0A> > > = > > then you're going to need all the blocks, so a tablescan is better=0A> = > > > because=0A> > > > > it's simpler.=A0 If each user has 1MB+ data then = you will want to=0A> pick=0A> > > out=0A> > > > > the individual blocks rel= evant to each date.=A0 The secondary index=0A> > will=0A> > > > help=0A> > = > > > you go directly to those sparse blocks, but with a cost in=0A> > comp= lexity,=0A> > > > > consistency, and extra denormalized data that knocks pr= imary data=0A> out=0A> > > of=0A> > > > > your block cache.=0A> > > > >=0A>= > > > > If latency is not a concern, I would start with the table scan.=A0= If=0A> > > > that's=0A> > > > > too slow you add the secondary index, and = if you still need it=0A> faster=0A> > > you=0A> > > > > do the primary key = lookups in parallel as Jerry mentions.=0A> > > > >=0A> > > > > Matt=0A> > >= > >=0A> > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam =0A> > > > wrote:=0A> > > > >=0A> > > > >> Hi Chris:=0A> > > > >>= =0A> > > > >> I'm thinking about building a secondary index for primary key= =0A> > lookup,=0A> > > > then=0A> > > > >> query using the primary keys in = parallel.=0A> > > > >>=0A> > > > >> I'm interested to see if there is other= option too.=0A> > > > >>=0A> > > > >> Best Regards,=0A> > > > >>=0A> > > >= >> Jerry=0A> > > > >>=0A> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Chris= tian Sch=E4fer <=0A> > > > syrious3000@yahoo.de=0A> > > > >>> wrote:=0A> > = > > >>=0A> > > > >>> Hello there,=0A> > > > >>>=0A> > > > >>> I designed a = row key for queries that need best performance (~100=0A> > ms)=0A> > > > >>= > which looks like this:=0A> > > > >>>=0A> > > > >>> userId-date-sessionId= =0A> > > > >>>=0A> > > > >>> These queries(scans) are always based on a use= rId and sometimes=0A> > > > >>> additionally on a date, too.=0A> > > > >>> = That's no problem with the key above.=0A> > > > >>>=0A> > > > >>> However, = another kind of queries shall be based on a given time=0A> > range=0A> > > = > >>> whereas the outermost left userId is not given or known.=0A> > > > >>= > In this case I need to get all rows covering the given time range=0A> > >= with=0A> > > > >>> their date to create a daily reporting.=0A> > > > >>>= =0A> > > > >>> As I can't set wildcards at the beginning of a left-based in= dex=0A> for=0A> > > the=0A> > > > >>> scan,=0A> > > > >>> I only see the po= ssibility to scan the index of the whole table=0A> to=0A> > > > >> collect= =0A> > > > >>> the=0A> > > > >>> rowKeys that are inside the timerange I'm = interested in.=0A> > > > >>>=0A> > > > >>> Is there a more elegant way to c= ollect rows within time range X?=0A> > > > >>> (Unfortunately, the date att= ribute is not equal to the timestamp=0A> > that=0A> > > > is=0A> > > > >>> = stored by hbase automatically.)=0A> > > > >>>=0A> > > > >>> Could/should on= e maybe leverage some kind of row key caching to=0A> > > > >> accelerate=0A= > > > > >>> the collection process?=0A> > > > >>> Is that covered by the bl= ock cache?=0A> > > > >>>=0A> > > > >>> Thanks in advance for any advice.=0A= > > > > >>>=0A> > > > >>> regards=0A> > > > >>> Chris=0A> > > > >>>=0A> > >= > >>=0A> > > >=0A> > >=0A> > >=0A> > >=0A> > > --=0A> > > Alex Baranau=0A>= > > ------=0A> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase= -=0A> ElasticSearch=0A> > -=0A> > > Solr=0A> > >=0A> >=0A>=0A>=0A>=0A> --= =0A> Alex Baranau=0A> ------=0A> Sematext :: http://blog.sematext.com/ :: H= adoop - HBase - ElasticSearch -=0A> Solr=0A>=0A