Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of magnito@gmail.com designates
 209.85.210.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=bCg2BtisylqYO1hkKooh42tqs6wMLEQK5PN2/7pxsWQGU71tZIZ7ocJmvg4H5qjQaL
         8scldq2F8p8LNTyuVUCRgq2ZMIGMA4awsF7bqTu8DSYXoX5z/qGcEI7RgzlaMqgg6Y2P
         O4Wo53d8DWMC7s3VQsomkrAb6IirtXQhlJwHk=
MIME-Version: 1.0
In-Reply-To: <AANLkTimUNEqotLzi4-ezZMCC+_PR7_7iGaEm2Dg1dcxA@mail.gmail.com>
References: <AANLkTimo9SKbhuvSMfLNx+j2=+xuDdmkPmHT99_b+18n@mail.gmail.com>
	<AANLkTi=xS8BvUvmMoPMdo_vLZwt3uDFjGQMBG5dO0a6P@mail.gmail.com>
	<AANLkTinv1Du8J+x4Jha+cRDakJgfjfxqD0K=A+PkrRjS@mail.gmail.com>
	<AANLkTi=TWS1zXQ+Nhp_3tmqOejuPt=4jvaHLopoqpD+M@mail.gmail.com>
	<AANLkTi=yjP-sUP2yFfpUhFKdGVAfa82FQTKEzkNYeHXb@mail.gmail.com>
	<AANLkTikAtaPcO1OL09g-t1mTh+0SqoSX2++1gRYud38=@mail.gmail.com>
	<AANLkTi=XvLUeA9iNcumd4EbjpGzQhGtr7NnEL8PfjrMG@mail.gmail.com>
	<AANLkTimhJRrKW=6sF0NunFPuMOeoUD-P_WUoEtd9ZLfi@mail.gmail.com>
	<AANLkTimUNEqotLzi4-ezZMCC+_PR7_7iGaEm2Dg1dcxA@mail.gmail.com>
Date: Sat, 8 Jan 2011 14:57:57 -0800
Message-ID: <AANLkTikRNP3XAEc_Gv1y3rHbHQVRmJ_a_vvPpRTY_VJ=@mail.gmail.com>
Subject: Re: question about merge-join (or AND operator betwween colums)
From: Jack Levin <magnito@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=0022150482b77767da04995daaf5

--0022150482b77767da04995daaf5
Content-Type: text/plain; charset=ISO-8859-1

Future wise we plan to have millions of rows, probably across multiple
regions, even if IO is not a problem, doing millions of filter operations
does not make much sense.

-Jack

On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <octo47@gmail.com> wrote:

> Ok. Understand.
>
> But do you check is it really an issue? I think that it is only 1 IO here,
> (especially
> if compression used)? You have big rows?
>
>
>
> 2011/1/9 Jack Levin <magnito@gmail.com>
>
> > Sorting is not the issue, the location of data can be in the beginning,
> > middle or end, or any combination of thereof.  I only given the worst
> case
> > scenario example, I understand that filtering will produce results we
> want
> > but at cost of examining every row and offloading AND/join logic to the
> > application.
> >
> > -Jack
> >
> > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <octo47@gmail.com>
> wrote:
> >
> > > More details on binary sorting you can read
> > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > >
> > > 2011/1/8 Jack Levin <magnito@gmail.com>
> > >
> > > > Basic problem described:
> > > >
> > > > user uploads 1 image and creates some text -10 days ago, then creates
> > > 1000
> > > > text messages on between 9 days ago and today:
> > > >
> > > >
> > > > row key          | fm:type --> value
> > > >
> > > >
> > > > 00days:uid     | type:text --> text_id
> > > >
> > > > .
> > > >
> > > > .
> > > >
> > > > 09days:uid | type:text --> text_id
> > > >
> > > >
> > > > 10days:uid     | type:photo --> URL
> > > >
> > > >          | type:text --> text_id
> > > >
> > > >
> > > > Skip all the way to 10days:uid row, without reading 00days:id -
> 09:uid
> > > > rows.
> > > >  Ideally we do not want to read all 1000 entries that have _only_
> text.
> > >  We
> > > > want to get to last entry in the most efficient way possible.
> > > >
> > > >
> > > > -Jack
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <stack@duboce.net> wrote:
> > > > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> the
> > > > > column that you know has the lesser items and then per item, you'd
> do
> > > > > a get inside the coprocessor against the column of many entries.
>  The
> > > > > get would go via blooms.
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <stack@duboce.net> wrote:
> > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <magnito@gmail.com>
> > > wrote:
> > > > >>> Yes, we thought about using filters, the issue is, if one family
> > > > >>> column has 1ml values, and second family column has 10 values at
> > the
> > > > >>> bottom, we would end up scanning and filtering 99990 records and
> > > > >>> throwing them away, which seems inefficient.
> > > > >>
> > > > >> Blooms+filters?
> > > > >> St.Ack
> > > > >>
> > > > >
> > > >
> > >
> >
>

--0022150482b77767da04995daaf5--