hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop John <anoop.hb...@gmail.com>
Subject Re: Eliminating duplicate values
Date Sun, 03 Mar 2013 15:37:32 GMT
Matt Corgan
                 I remember, some one else also sent mail some days back
looking for same use case
Yes CP can help. May be do deletion of duplicates at Major compact time?

-Anoop-

On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <mcorgan@hotpads.com> wrote:

> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever.  At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates.  By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
>
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
>
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
>
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
>
> Thanks for any tips,
> Matt
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message