hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh <jof...@gmail.com>
Subject Re: Efficient time based queries - TIMERANGE or STARTROW/STOPROW?
Date Thu, 13 Apr 2017 13:49:45 GMT
Hi Josh,

Ok thanks for the insights. I will stick to using STARTROW and STOPROW
along with a reverse timestamp in the row key for these queries then.

Thanks!

On Thu, Apr 13, 2017 at 4:36 AM, Josh Elser <josh.elser@gmail.com> wrote:

> If it's helpful to state it in generic terms: specifying a range of
> HBase timestamps is *only* a post-filter (server-side) and *never* a
> primary search criteria.
>
> In other words, searching by the HBase timestamp is a full-table scan
> (exhaustive search). While the timestamp can be nice for certain
> use-cases, I think it best to use it sparingly and only for its
> original purpose (preserving old versions of a cell), and not as a
> "query parameter". I've seen way too many use-cases where people think
> they're being tricky by using timestamps to implement a poor-man's
> search. This works for small amounts of data, but quickly falls to its
> knees with non-trivial data-sets.
>
> On Wed, Apr 12, 2017 at 3:05 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > If you change the rowkey, it would be tricky to satisfy what you stated
> in
> > your first email:
> >
> > bq. retrieve the most recent 10 rows
> >
> > Since the same uuid may have many rows which are no longer sorted in the
> > order amenable to your query.
> >
> > Cheers
> >
> > On Wed, Apr 12, 2017 at 10:43 AM, Josh <jofo90@gmail.com> wrote:
> >
> >> Hi Ted,
> >>
> >> Thanks for the fast reply!
> >> Ok I see - just out of interest, if I changed my row key to be
> >> uuid#timestamp  (instead of uuid#reverse_timestamp) - would the
> timestamp
> >> approach still be equally efficient? I just want to understand whether
> or
> >> not the timestamp approach is relying on the ordering of my row keys.
> >>
> >> Josh
> >>
> >> On Wed, Apr 12, 2017 at 6:39 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>
> >> > Since STARTROW is specified (with uuid) in both of your examples, I
> think
> >> > their efficiency should be tantamount.
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Apr 12, 2017 at 10:33 AM, Josh <jofo90@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am just getting started with HBase, and have a question about the
> >> > > efficiency of timestamp based scans.
> >> > >
> >> > > My table's row key has structure `uuid#reverse_timestamp` where
> >> > > reverse_timestamp is (java.lang.Long.MAX_VALUE - time in millis when
> >> the
> >> > > row was written). For a given uuid I want to be able to retrieve the
> >> most
> >> > > recent 10 rows in the table where timestamp is greater than x. It's
> >> > > possible that a given uuid may have many thousands of rows (with
> >> > different
> >> > > timestamps).
> >> > >
> >> > > I found there are two ways to run my query:
> >> > > 1. use HBase's built in timestamps and scan a time range:
> >> > > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-
> f498ddc8adcc',
> >> > > TIMERANGE => [x, current_time], LIMIT => 10}
> >> > >
> >> > > 2. use only my row keys to do the scan, with STARTROW and STOPROW:
> >> > > scan 'mytable', {STARTROW => '647b2194-fbb8-46af-95ba-
> f498ddc8adcc',
> >> > > STOPROW='647b2194-fbb8-46af-95ba-f498ddc8adcc#x', LIMIT => 10}
> >> > >
> >> > > Both of these seem to work - but is one more efficient that the
> other?
> >> > >
> >> > > Thanks for any advice,
> >> > > Josh
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message