hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: scan column families with different time ranges
Date Mon, 03 Aug 2015 19:23:01 GMT
I have not tried out stripe compaction and don't see how it would help here.

On Mon, Aug 3, 2015 at 12:16 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. revive some notion of tiered compaction
>
> Did you have a chance to try out Stripe compaction ?
>
> Thanks
>
> On Mon, Aug 3, 2015 at 11:14 AM, Dave Latham <latham@davelink.net> wrote:
>
> > Jean-Marc,
> >
> > "Recent" is often last 24 hours or so, though if this is worked out I may
> > use it for other ranges as well.  Yes, currently there are weekly major
> > compactions, so recently compacted regions would not be able to exclude
> the
> > old store files. That's why I'm also hoping to revive some notion of
> tiered
> > compaction to keep older data in separate store files from recent data.
> >
> > Dave
> >
> > On Sun, Aug 2, 2015 at 6:22 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > wrote:
> >
> > > Just thinking at loud :
> > > "Cutting out the old store files could well also reduce disk IO for
> > > that family by 100x."
> > >
> > > What is "recent"  for your data? More than 7 days?  Or less? Don't you
> > have
> > > weekly major compactions?  If so and if you are scanning for  more
> than 7
> > > days,  then you will read the older files anyway, no?
> > >
> > > JM
> > > Le 2015-08-02 05:57, "Ted Yu" <yuzhihong@gmail.com> a écrit :
> > >
> > > > Dave:
> > > > I wonder if Filter response can be enhanced in the following manner:
> > > >
> > > > http://pastebin.com/sb6apTPm
> > > >
> > > > My approach is based on using essential column family (column family
> A
> > in
> > > > your case) to guide whether the remaining column families should be
> > > loaded.
> > > > To be specific, if outside the TimeRange you specify (last day), your
> > > > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW.
> > > >
> > > > What do you think ?
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <latham@davelink.net>
> > wrote:
> > > >
> > > > > Thanks for brainstorming, Ted.  That sounds like option 2 I listed
> > > using
> > > > a
> > > > > separate scanner for A vs B which "adds complexity to the job and
> > gives
> > > > up
> > > > > the atomicity/consistency guarantees as new writes hit both column
> > > > > families".
> > > > >
> > > > > On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > > >
> > > > > > Can you achieve your goal with two scans ?
> > > > > > The first scan specifies TimeRange corresponding to last day.
> This
> > > scan
> > > > > > returns both column families.
> > > > > > The other scan specifies TimeRange excluding last day. This
scan
> > > > returns
> > > > > > column family A.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <latham@davelink.net
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Ted,
> > > > > > >
> > > > > > > Thanks for the suggestion, but I'm not sure that it helps
my
> case
> > > > much.
> > > > > > I
> > > > > > > wasn't very familiar with the feature, and it doesn't seem
very
> > > well
> > > > > > > documented - I had to go to the source and the originating
JIRA
> > to
> > > > > > > understand how it works.  It sounds like it allows you
to mark
> > > which
> > > > > > column
> > > > > > > families the filter operates on ("essential" seems an odd
> name).
> > > If
> > > > > any
> > > > > > > data from those column families passes the filter, then
the
> scan
> > > > loads
> > > > > > and
> > > > > > > includes data from the remaining families without filtering
it.
> > In
> > > > my
> > > > > > > case, it's not clear from a row's family A whether or not
> family
> > B
> > > > for
> > > > > > that
> > > > > > > row is required (though that could probably be added).
> Moreover,
> > > > even
> > > > > > if a
> > > > > > > row has recent data, we don't want to load all the old
data
> from
> > > that
> > > > > > row.
> > > > > > > We'd prefer to be able to entirely skip reading the data
off
> disk
> > > for
> > > > > the
> > > > > > > old store files.
> > > > > > >
> > > > > > > Dave
> > > > > > >
> > > > > > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhihong@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Have you considered using essential column family
feature
> > > (through
> > > > > > > Filter)
> > > > > > > > ?
> > > > > > > > In your case A would be the essential column family.
> > > > > > > > Within TimeRange for recent data, the filter would
return
> both
> > > > column
> > > > > > > > families.
> > > > > > > > Outside the TimeRange, only family A is returned.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <
> > latham@davelink.net
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I have a table with 2 column families, call them
A and B,
> > with
> > > > new
> > > > > > data
> > > > > > > > > regularly being added. They are very different
sizes: B is
> > 100x
> > > > the
> > > > > > > size
> > > > > > > > of
> > > > > > > > > A.  Among other uses for this data, I have a
MapReduce job
> > that
> > > > > needs
> > > > > > > to
> > > > > > > > > read all of A, but only recent data from B (e.g.
last day).
> > > Here
> > > > > are
> > > > > > > > some
> > > > > > > > > methods I've considered:
> > > > > > > > >
> > > > > > > > >    1. Use a Filter to get throw out older data
from B (this
> > is
> > > > > what I
> > > > > > > > >    currently do).  However, all the data from
B still needs
> > to
> > > be
> > > > > > read
> > > > > > > > from
> > > > > > > > >    disk, causing a disk IO bottleneck.
> > > > > > > > >    2. Configure the table input format to read
from B only,
> > > > using a
> > > > > > > > >    TimeRange for recent data, and have each map
task open a
> > > > > separate
> > > > > > > > > scanner
> > > > > > > > >    for A (without a TimeRange) then merge the
data in the
> map
> > > > task.
> > > > > > > > > However,
> > > > > > > > >    this adds complexity to the job and gives
up the
> > > > > > > atomicity/consistency
> > > > > > > > >    guarantees as new writes hit both column families.
> > > > > > > > >    3. Add a new column family C to the table
with an
> > additional
> > > > > copy
> > > > > > of
> > > > > > > > the
> > > > > > > > >    data in B, but set a TTL on it.  All writes
duplicate
> the
> > > data
> > > > > > > written
> > > > > > > > > to B
> > > > > > > > >    and C.  Change the scan to include C instead
of B.
> > However,
> > > > > this
> > > > > > > adds
> > > > > > > > > all
> > > > > > > > >    the overhead of another column family, more
writes, and
> > > having
> > > > > to
> > > > > > > set
> > > > > > > > > the
> > > > > > > > >    TTL to the maximum of any time window I want
to scan
> > > > > efficiently.
> > > > > > > > >    4. Implement an enhancement to HBase's Scan
to allow
> > giving
> > > > each
> > > > > > > > column
> > > > > > > > >    family its own TimeRange.  The job would then
be able to
> > > skip
> > > > > most
> > > > > > > old
> > > > > > > > >    large store files (hopefully all of them with
tiered
> > > > compaction
> > > > > at
> > > > > > > > some
> > > > > > > > >    point).
> > > > > > > > >
> > > > > > > > > Does anyone have other suggestions?  Would HBase
be willing
> > to
> > > > > accept
> > > > > > > > > updating Scan to have different TimeRange's for
each column
> > > > > families?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Dave
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message