hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: Secondary Index versus Full Table Scan
Date Wed, 04 Aug 2010 22:10:53 GMT
Also seek/reseek hooks in the filters will allow skipping of blocks, which for some queries
(returning high % of total data) it won't matter but for more sparse filters that want to
jump this can be significant.

These are being worked on by an intern here and should have some patches up in a couple weeks.

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Wednesday, August 04, 2010 2:15 PM
> To: user@hbase.apache.org
> Subject: Re: Secondary Index versus Full Table Scan
> 
> On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand <
> luke.forehand@networkedinsights.com> wrote:
> 
> > Todd Lipcon <todd@...> writes:
> >
> > > The above is true if you assume you can only do one get at a time.
> In
> > fact,
> > > you can probably pipeline gets, and there's actually a patch in the
> works
> > > for multiget support - HBASE-1845. I don't think it's being
> actively
> > worked
> > > on at the moment, though, so you'll have to do it somewhat
> manually. I'd
> > > recommend using multithreading in each mapper so that the keys come
> off
> > the
> > > scan into a small thread pool executor which performs the gets -
> this
> > should
> > > get you some parallelism. Otherwise you'll find the mappers are
> mostly
> > > spending time waiting on the network and not doing work.
> >
> > Excellent!  We will definitely try multithreading the gets.
> >
> > > It highly depends on the selectivity - if you're able to cut out a
> very
> > > large percentage of the records using your secondary index, then
> you'll
> > be
> > > saving time for sure. If not, then you've just turned your
> sequential IO
> > > (read: fast) into random IO (read: slow). It's better to do a few
> random
> > IOs
> > > than a lot of sequential, but better to do a lot of sequential than
> a lot
> > of
> > > random, if that makes any sense.
> >
> > Yes we are coming to terms with this very quickly :-)  It's easier to
> find
> > the
> > balance now that we're working with some real data...
> >
> > > One thing that no one has raised yet is whether you're using the
> Filter
> > API.
> > > If you're not already using Filters to apply a server side
> predicate, I'd
> > > recommend looking into it. This will allow you to reduce the amount
> of
> > > network traffic between the mappers and the region servers, and
> should
> > > improve performance noticeably.
> >
> > We are using the Filter API but our mappers are local to the region
> servers
> > so
> > we didn't notice much of an improvement.  Does that make sense?
> >
> > Yep, certainly - you avoid a few extra copies between processes by
> using
> filters, but probably not a huge difference there in that case.
> 
> What's really missing is the pushdown of the filters all the way to the
> storage layer - this is where something like bitmap indexes will likely
> help
> - we'll be able to avoid reading the data off disk when it doesn't
> match,
> and thus query time will be closer to linear with the number of matches
> instead of linear with the amount of data.
> 
> -Todd
> 
> --
> Todd Lipcon
> Software Engineer, Cloudera

Mime
View raw message