From Andrew Purtell
Date Wed, 30 May 2012
Subject Re: Distinct counters and counting rows
Date Wed, 30 May 2012 22:32:55 GMT
A common question about HBase is if statistics on row index cardinality are

The short answer is no, because in some sense each HBase table region is
its own database, and each region is partly in memory and partly (log
structured) on disk, including perhaps tombstones, so discovering the count
of all unique keys in the full table requires the client iterate over all
rows in all regions. Only then might all live row keys be found.

However as others have mentioned the coprocessor framework can help someone
implement fast counting. When a region is first opened all data is in
HFiles and each HFile knows the number of keys within it (though not unique
keys at the moment). So a coprocessor could add new metadata (a unique row
key count) to HFiles when writing them, at flush and compaction times. And
then load and sum such counts at region open time. And then maintain a
probabilistic count at runtime using available blooms as new entries are
stored into the Memstore*. The exact count would be available again upon
the next open.

*- Though offhand I'm not sure what to do about deletes.

If someone does end up implementing something like this, please consider
contributing it back because it's not uncommonly discussed.

    - Andy

On Wednesday, May 30, 2012, Ramkrishna.S.Vasudevan wrote:

> To answer this question
> Alternatively, is there a way to trigger an increment in another table (say
> "count") whenever a row was added to "user"?
> You can try to use Coprocessors here.  Like once a put is done to the table
> 'user' using the coprocessor hooks you can trigger an Increment() operation
> on table 'count'.
> This can be done on one call from client.  Also the increment() operation
> guarantees atomicity.
> Hope this helps.
> Regards
> Ram
> > -----Original Message-----
> > From: David Koch [mailto:ogdude@googlemail.com <javascript:;>]
> > Sent: Wednesday, May 30, 2012 12:47 PM
> > To: user@hbase.apache.org <javascript:;>
> > Subject: Distinct counters and counting rows
> >
> > Hello,
> >
> > I am testing HBase for distinct counters - more concretely, counting
> > unique users from a fairly large stream of user_ids. For some time to
> > come the volume will be limited enough to use exact counting rather
> > than approximation but already it's too big to hold the entire set of
> > user_ids in memory.
> >
> > For now I am basically inserting all elements from the stream into a
> > "user" table which has row key "user_id" as to enforce the unique
> > constraint.
> >
> > My question:
> > a) Is there a way to get a quick (i.e with small delay in a user
> > interface) count of the size of the user table to return the number of
> > users? Alternatively, is there a way to trigger an increment in
> > another table (say "count") whenever a row was added to "user"? I
> > guess this can be picked up eventually by the client application but I
> > don't want this to delay the actual stream processing.
> > b) I heard about Bloom filters in HBase but failed to understand if
> > they are used for row keys as well. Are they? How do I activate it? I
> > was looking to reduce the work-load of checking set membership for
> > every user_id in the stream. If this is done by HBase internally even
> > better.
> > c) Eventually, I want to store distinct users by day and then do
> > unions on different days to get the total amount of unique users for a
> > multi-day period. Is this likely to involve a Map Reduce or is there a
> > more "light-weight" approach?
> >
> > Thank you,
> >
> > /David

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

