hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Secondary Index versus Full Table Scan
Date Wed, 04 Aug 2010 21:14:53 GMT
On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand <
luke.forehand@networkedinsights.com> wrote:

> Todd Lipcon <todd@...> writes:
>
> > The above is true if you assume you can only do one get at a time. In
> fact,
> > you can probably pipeline gets, and there's actually a patch in the works
> > for multiget support - HBASE-1845. I don't think it's being actively
> worked
> > on at the moment, though, so you'll have to do it somewhat manually. I'd
> > recommend using multithreading in each mapper so that the keys come off
> the
> > scan into a small thread pool executor which performs the gets - this
> should
> > get you some parallelism. Otherwise you'll find the mappers are mostly
> > spending time waiting on the network and not doing work.
>
> Excellent!  We will definitely try multithreading the gets.
>
> > It highly depends on the selectivity - if you're able to cut out a very
> > large percentage of the records using your secondary index, then you'll
> be
> > saving time for sure. If not, then you've just turned your sequential IO
> > (read: fast) into random IO (read: slow). It's better to do a few random
> IOs
> > than a lot of sequential, but better to do a lot of sequential than a lot
> of
> > random, if that makes any sense.
>
> Yes we are coming to terms with this very quickly :-)  It's easier to find
> the
> balance now that we're working with some real data...
>
> > One thing that no one has raised yet is whether you're using the Filter
> API.
> > If you're not already using Filters to apply a server side predicate, I'd
> > recommend looking into it. This will allow you to reduce the amount of
> > network traffic between the mappers and the region servers, and should
> > improve performance noticeably.
>
> We are using the Filter API but our mappers are local to the region servers
> so
> we didn't notice much of an improvement.  Does that make sense?
>
> Yep, certainly - you avoid a few extra copies between processes by using
filters, but probably not a huge difference there in that case.

What's really missing is the pushdown of the filters all the way to the
storage layer - this is where something like bitmap indexes will likely help
- we'll be able to avoid reading the data off disk when it doesn't match,
and thus query time will be closer to linear with the number of matches
instead of linear with the amount of data.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message