hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sagar naik <sn...@attributor.com>
Subject Re: Counter columns in hbase
Date Tue, 06 Sep 2011 18:01:20 GMT
Thanks for thoughtful response, Arvind

Comments inline:

On Sun, Sep 4, 2011 at 12:17 AM, Arvind Jayaprakash <work@anomalizer.net>wrote:

> On Sep 02, sagar naik wrote:
> >We are counting events for our application.
> >Sometimes, the same event arrives multiple times.
> >This leads to counting of same event multiple times.
> >Is there a way I can avoid this ?
> >(Say timestamp on value or filters ?)
>
> A basic data model question is do you want a set that supports a "count"
> operation or an accumulator?
>
> An accumulator (counter) is a plain integer where you just get to bump
> up counts. What contributes to the counts is outside the purview of this
> scheme.
>
> A set on the other hand is used for accumulating items that have an
> idenity. When you say "same event, multiple times", it sounds like the
> events have some identity and that you wish to de-duplicate the counting
> activity based on this idenity. The *only* way of implementing this
> accurately is you keep adding these idenities to a set. When you need a
> count of how many events have occured, then the cardinality of the set
> holds the answer.
>
> The principal difference is an O(1) solution v/s an O(n) solution with
> respect to memory requirements. The question to ask is if you would like
> to bear this cost in hbase or some layer that pre-processes data before
> it hits hbase. Note that hbase does not have an efficient support for
> "count (*)" operation in general.
>
> An approximate pre-processing solution can be as follows:
> * maintain a TTL based cache of all identities.
> * Do an atomic put-if-absent operation on this cache
> * Increment the counter if the put succeeds.
>

For this solution, I have to maintain a synchronization between the cache
and counter increments
( I dont see how to implement tht, in  a distributed env)
On the other hand:

I can dedup based on timestamp of the event.
Can I increment the counter value and assign the version as the timestamp of
this event ?
Is it possible to implement this via a filter ?



>
>
> This technique will bounded memory usage and a reasonable level of
> accuracy if the following assumptions hold good:
>
> * The arrival rate of events has a predictable & reasonable upper bound
> * The duplicate events arrive at roughly the same time
>

Thanks Again

-Sagar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message