Hi Yuki,

I think I should have used the word discussion instead of proposal for the mailing subject. I have quite some of a design in my mind but I think it's not yet ripe enough to formalize. I'll try to simplify it and open a Jira ticket.
But first I'm wondering if there would be any excitement in the community for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.yuki@gmail.com> wrote:
You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA with your proposal.

Just for the input:

I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out I didn't need it so I just put it to gist. You can find it here: https://gist.github.com/2597943

The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work.

Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values.

Yuki

On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:

Hi All,

Let's assume we have a use case where we need to count the number of columns for a given key. Let's say the key is the URL and the column-name is the IP address or any cardinality identifier.

The straight forward implementation seems to be simple, just inserting the IP Adresses as columns under the key defined by the URL and using get_count to count them back. However the problem here is in case of large rows (where too many IP addresses are in); the get_count method has to de-serialize the whole row and calculate the count. As also defined in the user guides, it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't have a strict requirement for the count to be exact. There are streaming algorithms that will provide good cardinality estimations within a predefined failure rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, also there's an optimal one developed recently, please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

If you want to take a look at the Java implementation for LogLog, Clearspring has both LogLog and space optimized HyperLogLog available at https://github.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra. The distributed nature of all these algorithms can easily be adapted to Cassandra's model. I think most of us would love to see come cardinality estimating columns in Cassandra.

Regards,
Utku