hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristoffer Sjögren <sto...@gmail.com>
Subject Re: Timerange scan
Date Mon, 02 Mar 2015 21:03:25 GMT
Wow, thanks again for the deep analysis. I may have to reconsider my
initial design then. I've always wanted to know to understand more about
HBase internals and this may be a good place to start digging.

Cheers,
-Kristoffer


On Mon, Mar 2, 2015 at 6:24 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> Sorry Kristoffer, but I believe my previous statement was mistaken. I
> cannot find a location where the timestamp is taken into account at the
> StoreFile level. I thought the above statement about metadata from the
> HFile headers was correct, but I cannot locate the code that takes such
> information into consideration. You can start at
> org.apache.hadoop.hbase.regionserver.StoreScanner and work your way down;
> or from the o.a.h.h.regionserver.StoreFileManager implementation (currently
> there exists two: DefaultStoreFileManager and StripeStoreFileManager) and
> work your way back. The closest thing we (may) have is the
> StripeStoreFileManager implementation, which creates "mini regions" within
> a single region. Even there, the stripes are arranged by row key (i.e.,
> Scan#getStartRow()), not by key-value key.
>
> I think we have no optimizations at the HFile level for the timestamp
> limits of a query. Which means, to answer your original question, absent a
> start row and end row on your scanner, you will be consuming the entire
> table. A long way of explaining, but HBase does not index by cell version
> (orders by at the end of a key-value's key, but not indexed), so if you
> want to model time in your schema, it's best to promote it to an indexed
> field -- i.e., make it a component of your row key.
>
> -n
>
> On Mon, Mar 2, 2015 at 12:42 AM, Kristoffer Sjögren <stoffe@gmail.com>
> wrote:
>
> > Thanks, great explanation!
> >
> > Forgive my laziness, but do you happen to know what part(s) of the code
> > base to look into even more details?
> >
> > On Sun, Mar 1, 2015 at 9:38 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > wrote:
> >
> > > I was going to say something similar. But as soon as you have a major
> > > compaction you endup with a single file and everything into it. So
> > > depending on your key distribution you might still read everything. If
> > you
> > > read just the last few minutes over a huge table, then yes, skip will
> > help.
> > > Else, I'm not sure it will hep that much :(
> > >
> > > 2015-02-28 18:25 GMT-05:00 Nick Dimiduk <ndimiduk@gmail.com>:
> > >
> > > > A Scan without start and end rows will be issued to all regions in
> the
> > > > table -- a full table scan. Within each region, store files will be
> > > > selected to participate in the scan based on on the min/max
> timestamps
> > > > from their
> > > > headers.
> > > >
> > > > On Saturday, February 28, 2015, Kristoffer Sjögren <stoffe@gmail.com
> >
> > > > wrote:
> > > >
> > > > > If Scan.setTimeRange is a full table scan then it runs surprisingly
> > > fast
> > > > on
> > > > > tables that host a few hundred million rows :-)
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Feb 28, 2015 at 8:05 PM, Kristoffer Sjögren <
> > stoffe@gmail.com
> > > > > <javascript:;>>
> > > > > wrote:
> > > > >
> > > > > > Hi Jean-Marc
> > > > > >
> > > > > > I was thinking of Scan.setTimeRange to only get the x latest
> rows,
> > > but
> > > > I
> > > > > > would like to avoid a full table scan.
> > > > > >
> > > > > > The alternative would be to use set the timestamp in the key
and
> > use
> > > > > start
> > > > > > and stop key. But since HBase already is aware of timestamps
I
> > tought
> > > > it
> > > > > > might optimize Scan.setTimeRange scans?
> > > > > >
> > > > > > Cheers,
> > > > > > -Kristoffer
> > > > > >
> > > > > > On Sat, Feb 28, 2015 at 7:45 PM, Jean-Marc Spaggiari <
> > > > > > jean-marc@spaggiari.org <javascript:;>> wrote:
> > > > > >
> > > > > >> Hi Kristoffer,
> > > > > >>
> > > > > >> What do you mean by "timerange scans"? If you want to scan
> > > everything
> > > > > from
> > > > > >> your table, you will always end up with a full table scan,
no?
> > > > > >>
> > > > > >> JM
> > > > > >>
> > > > > >> 2015-02-28 13:41 GMT-05:00 Kristoffer Sjögren <stoffe@gmail.com
> > > > > <javascript:;>>:
> > > > > >>
> > > > > >> > Hi
> > > > > >> >
> > > > > >> > I want to understand the effectiveness of timerange
scans
> > without
> > > > > >> setting
> > > > > >> > start and stop keys? Will HBase do a full table scan
or will
> the
> > > > scan
> > > > > be
> > > > > >> > optimized in any way?
> > > > > >> >
> > > > > >> > Cheers,
> > > > > >> > -Kristoffer
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message