Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
Sender: ddlatham@gmail.com
In-Reply-To: 
 <CALte62ys72+0gMqyCyAVataHNnbqoTtXUhYY3wNmvWYh7xT4Kg@mail.gmail.com>
References: 
 <CAA4mzom00ef5eoXStK0HEtxebY8mQSs61GBVGttgpASpmhQHaw@mail.gmail.com>
 <CALte62ycQ+6rmdvAaRZY-_jTzwHjhkHmuWjEAc6FXok3Qy5B8A@mail.gmail.com>
 <CAA4mzomEfuShjLbNX5V1RHfLgO3bemL0VX9_-gxtxnpZD+3LZQ@mail.gmail.com>
 <CALte62ys72+0gMqyCyAVataHNnbqoTtXUhYY3wNmvWYh7xT4Kg@mail.gmail.com>
From: Dave Latham <latham@davelink.net>
Date: Sat, 1 Aug 2015 20:06:00 -0700
Message-ID: 
 <CAA4mzo=n1Fjp8WbiMcpmO0gKyc9PTsBSp6zGk5jYvUyhc2SQ+A@mail.gmail.com>
Subject: Re: scan column families with different time ranges
To: hbase-user <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c0807fa573c63051c4b5629

--94eb2c0807fa573c63051c4b5629
Content-Type: text/plain; charset=UTF-8

Thanks for brainstorming, Ted.  That sounds like option 2 I listed using a
separate scanner for A vs B which "adds complexity to the job and gives up
the atomicity/consistency guarantees as new writes hit both column
families".

On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Can you achieve your goal with two scans ?
> The first scan specifies TimeRange corresponding to last day. This scan
> returns both column families.
> The other scan specifies TimeRange excluding last day. This scan returns
> column family A.
>
> Cheers
>
> On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <latham@davelink.net> wrote:
>
> > Hi Ted,
> >
> > Thanks for the suggestion, but I'm not sure that it helps my case much.
> I
> > wasn't very familiar with the feature, and it doesn't seem very well
> > documented - I had to go to the source and the originating JIRA to
> > understand how it works.  It sounds like it allows you to mark which
> column
> > families the filter operates on ("essential" seems an odd name).  If any
> > data from those column families passes the filter, then the scan loads
> and
> > includes data from the remaining families without filtering it.  In my
> > case, it's not clear from a row's family A whether or not family B for
> that
> > row is required (though that could probably be added).  Moreover, even
> if a
> > row has recent data, we don't want to load all the old data from that
> row.
> > We'd prefer to be able to entirely skip reading the data off disk for the
> > old store files.
> >
> > Dave
> >
> > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Have you considered using essential column family feature (through
> > Filter)
> > > ?
> > > In your case A would be the essential column family.
> > > Within TimeRange for recent data, the filter would return both column
> > > families.
> > > Outside the TimeRange, only family A is returned.
> > >
> > > Cheers
> > >
> > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <latham@davelink.net>
> wrote:
> > >
> > > > I have a table with 2 column families, call them A and B, with new
> data
> > > > regularly being added. They are very different sizes: B is 100x the
> > size
> > > of
> > > > A.  Among other uses for this data, I have a MapReduce job that needs
> > to
> > > > read all of A, but only recent data from B (e.g. last day).  Here are
> > > some
> > > > methods I've considered:
> > > >
> > > >    1. Use a Filter to get throw out older data from B (this is what I
> > > >    currently do).  However, all the data from B still needs to be
> read
> > > from
> > > >    disk, causing a disk IO bottleneck.
> > > >    2. Configure the table input format to read from B only, using a
> > > >    TimeRange for recent data, and have each map task open a separate
> > > > scanner
> > > >    for A (without a TimeRange) then merge the data in the map task.
> > > > However,
> > > >    this adds complexity to the job and gives up the
> > atomicity/consistency
> > > >    guarantees as new writes hit both column families.
> > > >    3. Add a new column family C to the table with an additional copy
> of
> > > the
> > > >    data in B, but set a TTL on it.  All writes duplicate the data
> > written
> > > > to B
> > > >    and C.  Change the scan to include C instead of B.  However, this
> > adds
> > > > all
> > > >    the overhead of another column family, more writes, and having to
> > set
> > > > the
> > > >    TTL to the maximum of any time window I want to scan efficiently.
> > > >    4. Implement an enhancement to HBase's Scan to allow giving each
> > > column
> > > >    family its own TimeRange.  The job would then be able to skip most
> > old
> > > >    large store files (hopefully all of them with tiered compaction at
> > > some
> > > >    point).
> > > >
> > > > Does anyone have other suggestions?  Would HBase be willing to accept
> > > > updating Scan to have different TimeRange's for each column families?
> > > >
> > > >
> > > > Dave
> > > >
> > >
> >
>

--94eb2c0807fa573c63051c4b5629--