Hi All,

Let's assume we have a use case where we need to count the number of columns for a given key. Let's say the key is the URL and the column-name is the IP address or any cardinality identifier.

The straight forward implementation seems to be simple, just inserting the IP Adresses as columns under the key defined by the URL and using get_count to count them back. However the problem here is in case of large rows (where too many IP addresses are in); the get_count method has to de-serialize the whole row and calculate the count. As also defined in the user guides, it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't have a strict requirement for the count to be exact. There are streaming algorithms that will provide good cardinality estimations within a predefined failure rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, also there's an optimal one developed recently, please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

If you want to take a look at the Java implementation for LogLog, Clearspring has both LogLog and space optimized HyperLogLog available at https://github.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra. The distributed nature of all these algorithms can easily be adapted to Cassandra's model. I think most of us would love to see come cardinality estimating columns in Cassandra.