hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: Eliminating duplicate values
Date Sun, 03 Mar 2013 05:09:43 GMT
There are no duplicates.
Cells have versions, which are time stamped. You could set the number of versions to one...
But I'd recommend sticking w the default 3.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 2, 2013, at 9:42 PM, Matt Corgan <mcorgan@hotpads.com> wrote:

> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever.  At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates.  By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
> 
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
> 
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
> 
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
> 
> Thanks for any tips,
> Matt

Mime
View raw message