hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Levin <magn...@gmail.com>
Subject Re: question about merge-join (or AND operator betwween colums)
Date Sat, 08 Jan 2011 22:26:11 GMT
Sorting is not the issue, the location of data can be in the beginning,
middle or end, or any combination of thereof.  I only given the worst case
scenario example, I understand that filtering will produce results we want
but at cost of examining every row and offloading AND/join logic to the
application.

-Jack

On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <octo47@gmail.com> wrote:

> More details on binary sorting you can read
>
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>
> 2011/1/8 Jack Levin <magnito@gmail.com>
>
> > Basic problem described:
> >
> > user uploads 1 image and creates some text -10 days ago, then creates
> 1000
> > text messages on between 9 days ago and today:
> >
> >
> > row key          | fm:type --> value
> >
> >
> > 00days:uid     | type:text --> text_id
> >
> > .
> >
> > .
> >
> > 09days:uid | type:text --> text_id
> >
> >
> > 10days:uid     | type:photo --> URL
> >
> >          | type:text --> text_id
> >
> >
> > Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> > rows.
> >  Ideally we do not want to read all 1000 entries that have _only_ text.
>  We
> > want to get to last entry in the most efficient way possible.
> >
> >
> > -Jack
> >
> >
> >
> >
> > On Sat, Jan 8, 2011 at 11:43 AM, Stack <stack@duboce.net> wrote:
> > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > > column that you know has the lesser items and then per item, you'd do
> > > a get inside the coprocessor against the column of many entries.  The
> > > get would go via blooms.
> > >
> > > St.Ack
> > >
> > >
> > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <stack@duboce.net> wrote:
> > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <magnito@gmail.com>
> wrote:
> > >>> Yes, we thought about using filters, the issue is, if one family
> > >>> column has 1ml values, and second family column has 10 values at the
> > >>> bottom, we would end up scanning and filtering 99990 records and
> > >>> throwing them away, which seems inefficient.
> > >>
> > >> Blooms+filters?
> > >> St.Ack
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message