hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop John <anoop.hb...@gmail.com>
Subject Re: Eliminating duplicate values
Date Sun, 03 Mar 2013 15:37:32 GMT
Matt Corgan
                 I remember, some one else also sent mail some days back
looking for same use case
Yes CP can help. May be do deletion of duplicates at Major compact time?


On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <mcorgan@hotpads.com> wrote:

> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever.  At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates.  By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
> Thanks for any tips,
> Matt

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message