Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.de;
  h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
  b=bPmNDpgvkessdaGV13Uvr05a7BFosBAL5Y2LL8cTcieojxuW6KFIhIdpcSgVKt7zXprhJdcoh1kDJScWd+j2wnXFWABv/ePX+FDX475ScINScXAz8hD2aZcpqATBxfIbkFiiNFRjx/BxT6DUAC/RvcWOeF0Nq4ECzqi5dlO+ofs=;
References: <1343748460.12346.YahooMailNeo@web171503.mail.ir2.yahoo.com>
 <CAG+ckK_tajNHuAB+G-2u3yJoSPq+YZUagFGnHC8dMsd1q7QXVg@mail.gmail.com>
 <CAOKsKJWR48Cz-P2zrec1ebQ2J14UDLgjOdYeaJwDgB_SGuN7sw@mail.gmail.com>
 <BLU0-SMTP601F4A647E5D974B2DDE718FC40@phx.gbl>
 <1343910199.66686.YahooMailNeo@web171503.mail.ir2.yahoo.com>
 <1343910525.89654.YahooMailNeo@web171501.mail.ir2.yahoo.com>
 <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com>
 <CAA7+SiBZ__zfQPrYJktMN8zSHqW4zWKxb0tjCOCd7MHk7_1i4Q@mail.gmail.com>
 <CAOKsKJVKx+N-yB3B0HZJg2qK9AZqWqFgtAumh+OOopu2G4Cyuw@mail.gmail.com>
 <CAA7+SiCNk9jf62MJrPQHvVpTMHqgorDZi2N3xodavZDnbPNcHA@mail.gmail.com>
 <CAOKsKJULnSfQoDSMHMa9McdscbXkTwv_B-c0Lw29B2WBpY7SVA@mail.gmail.com>
Message-ID: <1343986499.56641.YahooMailNeo@web171504.mail.ir2.yahoo.com>
Date: Fri, 3 Aug 2012 10:34:59 +0100 (BST)
From: =?iso-8859-1?Q?Christian_Sch=E4fer?= <syrious3000@yahoo.de>
Reply-To: =?iso-8859-1?Q?Christian_Sch=E4fer?= <syrious3000@yahoo.de>
Subject: Re: How to query by rowKey-infix
To: "user@hbase.apache.org" <user@hbase.apache.org>
In-Reply-To: 
 <CAOKsKJULnSfQoDSMHMa9McdscbXkTwv_B-c0Lw29B2WBpY7SVA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Matt,=0A=0Asure I got this in mind as an last option (at least on a limi=
ted subset of data).=0A=0ADue to our estimation of some billions rows a wee=
k a selective filtering needs to take place at the server side.=0A=0ABut I =
agree that one could do fine filtering stuff on the client side on a handy =
data subset to avoid getting the hbase schema & indexing (by coprocessors) =
too complicated.=0A=0Aregards=0AChris=0A=0A=0A=0A----- Urspr=FCngliche Mess=
age -----=0AVon: Matt Corgan <mcorgan@hotpads.com>=0AAn: user@hbase.apache.=
org=0ACC: =0AGesendet: 3:29 Freitag, 3.August 2012=0ABetreff: Re: How to qu=
ery by rowKey-infix=0A=0AYeah - just thought i'd point it out since people =
often have small tables=0Ain their cluster alongside the big ones, and when=
 generating reports,=0Asometimes you don't care if it finishes in 10 minute=
s vs an hour.=0A=0A=0AOn Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <alex.ba=
ranov.v@gmail.com>wrote:=0A=0A> I think this is exactly what Christian is t=
rying to (and should be trying=0A> to) avoid ;).=0A>=0A> I can't imagine us=
e-case when you need to filter something and you can do=0A> it with (at lea=
st) server-side filter, and yet in this situation you want=0A> to try to do=
 it on the client-side... Doing filtering on client-side when=0A> you can d=
o it on server-side just feels wrong. Esp. given that there's a=0A> lot of =
data in HBase (otherwise why would you use it).=0A>=0A> Alex Baranau=0A> --=
----=0A> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticS=
earch -=0A> Solr=0A>=0A> On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <mcorg=
an@hotpads.com> wrote:=0A>=0A> > Also Christian, don't forget you can read =
all the rows back to the client=0A> > and do the filtering there using what=
ever logic you like.=A0 HBase Filters=0A> > can be thought of as an optimiz=
ation (predicate push-down) over=0A> client-side=0A> > filtering.=A0 Pullin=
g all the rows over the network will be slower, but I=0A> > don't think we =
know enough about your data or speed requirements to rule=0A> it=0A> > out.=
=0A> >=0A> >=0A> > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <alex.baran=
ov.v@gmail.com=0A> > >wrote:=0A> >=0A> > > Hi Christian!=0A> > >=0A> > > If=
 to put off secondary indexes and assume you are going with "heavy=0A> > > =
scans", you can try two following things to make it much faster. If=0A> thi=
s=0A> > is=0A> > > appropriate to your situation, of course.=0A> > >=0A> > =
> 1.=0A> > >=0A> > > > Is there a more elegant way to collect rows within t=
ime range X?=0A> > > > (Unfortunately, the date attribute is not equal to t=
he timestamp that=0A> > is=0A> > > stored by hbase automatically.)=0A> > >=
=0A> > > Can you set timestamp of the Puts to the one you have in row key?=
=0A> Instead=0A> > > of relying on the one that HBase puts automatically (c=
urrent ts). If=0A> you=0A> > > can, this will improve reading speed a lot b=
y setting time range on=0A> > > scanner. Depending on how you are writing y=
our data of course, but I=0A> > assume=0A> > > that you mostly write data i=
n "time-increasing" manner.=0A> > >=0A> > > 2.=0A> > >=0A> > > If your user=
Id has fixed length, or you can change it so that it has=0A> > fixed=0A> > =
> length, then you can actually use smth like "wildcard"=A0 in row key.=0A>=
 > There's=0A> > > a way in Filter implementation to fast-forward to the re=
cord with=0A> > specific=0A> > > row key and by doing this skip many record=
s. This might be used as=0A> > follows:=0A> > > * suppose your userId is 5 =
characters in length=0A> > > * suppose you are scanning for records with ti=
me between 2012-08-01=0A> > > and 2012-08-08=0A> > > * when you scanning re=
cords and you face e.g. key=0A> > > "aaaaa_2012-08-09_3jh345j345kjh", where=
 "aaaaa" is user id, you can=0A> tell=0A> > > the scanner from your filter =
to fast-forward to key "aaaab_=0A> 2012-08-01".=0A> > > Because you know th=
at all remained records of user "aaaaa" don't fall=0A> > into=0A> > > the i=
nterval you need (as the time for its records will be >=3D=0A> > 2012-08-09=
).=0A> > >=0A> > > As of now, I believe you will have to implement your cus=
tom filter to=0A> do=0A> > > that.=0A> > > Pointer:=0A> > > org.apache.hado=
op.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT=0A> > > I believe I =
implemented similar thing some time ago. If this idea works=0A> > for=0A> >=
 > you I could look for the implementation and share it if it helps. Or=0A>=
 may=0A> > be=0A> > > even simply add it to HBase codebase.=0A> > >=0A> > >=
 Hope this helps,=0A> > >=0A> > > Alex Baranau=0A> > > ------=0A> > > Semat=
ext :: http://blog.sematext.com/ :: Hadoop - HBase -=0A> ElasticSearch=0A> =
> -=0A> > > Solr=0A> > >=0A> > >=0A> > > On Thu, Aug 2, 2012 at 8:40 AM, Ch=
ristian Sch=E4fer <=0A> syrious3000@yahoo.de=0A> > > >wrote:=0A> > >=0A> > =
> >=0A> > > >=0A> > > > Excuse my double posting.=0A> > > > Here is the com=
plete mail:=0A> > > >=0A> > > >=0A> > > > OK,=0A> > > >=0A> > > > at first =
I will try the scans.=0A> > > >=0A> > > > If that's too slow I will have to=
 upgrade hbase (currently=0A> > 0.90.4-cdh3u2)=0A> > > > to be able to use =
coprocessors.=0A> > > >=0A> > > >=0A> > > > Currently I'm stuck at the scan=
s because it requires two steps=0A> > (therefore=0A> > > > maybe some kind =
of filter chaining is required)=0A> > > >=0A> > > >=0A> > > > The key:=A0 u=
serId-dateInMillis-sessionId=0A> > > >=0A> > > > At first I need to extract=
 dateInMllis with regex or substring (using=0A> > > > special delimiters fo=
r date)=0A> > > >=0A> > > > Second, the extracted value must be parsed to L=
ong and set to a=0A> > RowFilter=0A> > > > Comparator like this:=0A> > > >=
=0A> > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new=0A>=
 > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));=0A> > > >=0A>=
 > > > How to chain that?=0A> > > > Do I have to write a custom filter?=0A>=
 > > > (Would like to avoid that due to deployment)=0A> > > >=0A> > > > reg=
ards=0A> > > > Chris=0A> > > >=0A> > > > ----- Urspr=FCngliche Message ----=
-=0A> > > > Von: Michael Segel <michael_segel@hotmail.com>=0A> > > > An: us=
er@hbase.apache.org=0A> > > > CC:=0A> > > > Gesendet: 13:52 Mittwoch, 1.Aug=
ust 2012=0A> > > > Betreff: Re: How to query by rowKey-infix=0A> > > >=0A> =
> > > Actually w coprocessors you can create a secondary index in short=0A>=
 > order.=0A> > > > Then your cost is going to be 2 fetches. Trying to do a=
 partial table=0A> > > scan=0A> > > > will be more expensive.=0A> > > >=0A>=
 > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcorgan@hotpads.com>=0A> =
wrote:=0A> > > >=0A> > > > > When deciding between a table scan vs secondar=
y index, you should=0A> try=0A> > > to=0A> > > > > estimate what percent of=
 the underlying data blocks will be used in=0A> > the=0A> > > > > query.=A0=
 By default, each block is 64KB.=0A> > > > >=0A> > > > > If each user's dat=
a is small and you are fitting multiple users per=0A> > > > block,=0A> > > =
> > then you're going to need all the blocks, so a tablescan is better=0A> =
> > > because=0A> > > > > it's simpler.=A0 If each user has 1MB+ data then =
you will want to=0A> pick=0A> > > out=0A> > > > > the individual blocks rel=
evant to each date.=A0 The secondary index=0A> > will=0A> > > > help=0A> > =
> > > you go directly to those sparse blocks, but with a cost in=0A> > comp=
lexity,=0A> > > > > consistency, and extra denormalized data that knocks pr=
imary data=0A> out=0A> > > of=0A> > > > > your block cache.=0A> > > > >=0A>=
 > > > > If latency is not a concern, I would start with the table scan.=A0=
 If=0A> > > > that's=0A> > > > > too slow you add the secondary index, and =
if you still need it=0A> faster=0A> > > you=0A> > > > > do the primary key =
lookups in parallel as Jerry mentions.=0A> > > > >=0A> > > > > Matt=0A> > >=
 > >=0A> > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chilinglam@gm=
ail.com>=0A> > > > wrote:=0A> > > > >=0A> > > > >> Hi Chris:=0A> > > > >>=
=0A> > > > >> I'm thinking about building a secondary index for primary key=
=0A> > lookup,=0A> > > > then=0A> > > > >> query using the primary keys in =
parallel.=0A> > > > >>=0A> > > > >> I'm interested to see if there is other=
 option too.=0A> > > > >>=0A> > > > >> Best Regards,=0A> > > > >>=0A> > > >=
 >> Jerry=0A> > > > >>=0A> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Chris=
tian Sch=E4fer <=0A> > > > syrious3000@yahoo.de=0A> > > > >>> wrote:=0A> > =
> > >>=0A> > > > >>> Hello there,=0A> > > > >>>=0A> > > > >>> I designed a =
row key for queries that need best performance (~100=0A> > ms)=0A> > > > >>=
> which looks like this:=0A> > > > >>>=0A> > > > >>> userId-date-sessionId=
=0A> > > > >>>=0A> > > > >>> These queries(scans) are always based on a use=
rId and sometimes=0A> > > > >>> additionally on a date, too.=0A> > > > >>> =
That's no problem with the key above.=0A> > > > >>>=0A> > > > >>> However, =
another kind of queries shall be based on a given time=0A> > range=0A> > > =
> >>> whereas the outermost left userId is not given or known.=0A> > > > >>=
> In this case I need to get all rows covering the given time range=0A> > >=
 with=0A> > > > >>> their date to create a daily reporting.=0A> > > > >>>=
=0A> > > > >>> As I can't set wildcards at the beginning of a left-based in=
dex=0A> for=0A> > > the=0A> > > > >>> scan,=0A> > > > >>> I only see the po=
ssibility to scan the index of the whole table=0A> to=0A> > > > >> collect=
=0A> > > > >>> the=0A> > > > >>> rowKeys that are inside the timerange I'm =
interested in.=0A> > > > >>>=0A> > > > >>> Is there a more elegant way to c=
ollect rows within time range X?=0A> > > > >>> (Unfortunately, the date att=
ribute is not equal to the timestamp=0A> > that=0A> > > > is=0A> > > > >>> =
stored by hbase automatically.)=0A> > > > >>>=0A> > > > >>> Could/should on=
e maybe leverage some kind of row key caching to=0A> > > > >> accelerate=0A=
> > > > >>> the collection process?=0A> > > > >>> Is that covered by the bl=
ock cache?=0A> > > > >>>=0A> > > > >>> Thanks in advance for any advice.=0A=
> > > > >>>=0A> > > > >>> regards=0A> > > > >>> Chris=0A> > > > >>>=0A> > >=
 > >>=0A> > > >=0A> > >=0A> > >=0A> > >=0A> > > --=0A> > > Alex Baranau=0A>=
 > > ------=0A> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase=
 -=0A> ElasticSearch=0A> > -=0A> > > Solr=0A> > >=0A> >=0A>=0A>=0A>=0A> --=
=0A> Alex Baranau=0A> ------=0A> Sematext :: http://blog.sematext.com/ :: H=
adoop - HBase - ElasticSearch -=0A> Solr=0A>=0A