hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: scan column families with different time ranges
Date Sat, 01 Aug 2015 14:53:07 GMT
Have you considered using essential column family feature (through Filter) ?
In your case A would be the essential column family.
Within TimeRange for recent data, the filter would return both column
Outside the TimeRange, only family A is returned.


On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <latham@davelink.net> wrote:

> I have a table with 2 column families, call them A and B, with new data
> regularly being added. They are very different sizes: B is 100x the size of
> A.  Among other uses for this data, I have a MapReduce job that needs to
> read all of A, but only recent data from B (e.g. last day).  Here are some
> methods I've considered:
>    1. Use a Filter to get throw out older data from B (this is what I
>    currently do).  However, all the data from B still needs to be read from
>    disk, causing a disk IO bottleneck.
>    2. Configure the table input format to read from B only, using a
>    TimeRange for recent data, and have each map task open a separate
> scanner
>    for A (without a TimeRange) then merge the data in the map task.
> However,
>    this adds complexity to the job and gives up the atomicity/consistency
>    guarantees as new writes hit both column families.
>    3. Add a new column family C to the table with an additional copy of the
>    data in B, but set a TTL on it.  All writes duplicate the data written
> to B
>    and C.  Change the scan to include C instead of B.  However, this adds
> all
>    the overhead of another column family, more writes, and having to set
> the
>    TTL to the maximum of any time window I want to scan efficiently.
>    4. Implement an enhancement to HBase's Scan to allow giving each column
>    family its own TimeRange.  The job would then be able to skip most old
>    large store files (hopefully all of them with tiered compaction at some
>    point).
> Does anyone have other suggestions?  Would HBase be willing to accept
> updating Scan to have different TimeRange's for each column families?
> Dave

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message