incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuki Morishita <>
Subject Re: Distinct Counter Proposal for Cassandra
Date Wed, 13 Jun 2012 17:00:41 GMT
You can open JIRA ticket at with your proposal.

Just for the input:

I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out
I didn't need it so I just put it to gist. You can find it here:

The above implementation and most of the other ones (including stream-lib) implement the optimized
version of the algorithm which counts up to 10^9, so may need some work.

Other alternative is self-learning bitmap (
which, in my understanding, is more memory efficient when counting small values.


On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Top├žu wrote:

> Hi All,
> Let's assume we have a use case where we need to count the number of columns for a given
key. Let's say the key is the URL and the column-name is the IP address or any cardinality
> The straight forward implementation seems to be simple, just inserting the IP Adresses
as columns under the key defined by the URL and using get_count to count them back. However
the problem here is in case of large rows (where too many IP addresses are in); the get_count
method has to de-serialize the whole row and calculate the count. As also defined in the user
guides, it's not an O(1) operation and it's quite costly.
> However, this problem seems to have better solutions if you don't have a strict requirement
for the count to be exact. There are streaming algorithms that will provide good cardinality
estimations within a predefined failure rate, I think the most popular one seems to be the
(Hyper)LogLog algorithm, also there's an optimal one developed recently, please check
> If you want to take a look at the Java implementation for LogLog, Clearspring has both
LogLog and space optimized HyperLogLog available at
> I don't see a reason why this can't be implemented in Cassandra. The distributed nature
of all these algorithms can easily be adapted to Cassandra's model. I think most of us would
love to see come cardinality estimating columns in Cassandra.
> Regards,
> Utku

View raw message