incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utku Can Top├žu <u...@topcu.gen.tr>
Subject Distinct Counter Proposal for Cassandra
Date Wed, 13 Jun 2012 16:28:28 GMT
Hi All,

Let's assume we have a use case where we need to count the number of
columns for a given key. Let's say the key is the URL and the column-name
is the IP address or any cardinality identifier.

The straight forward implementation seems to be simple, just inserting the
IP Adresses as columns under the key defined by the URL and using get_count
to count them back. However the problem here is in case of large rows
(where too many IP addresses are in); the get_count method has to
de-serialize the whole row and calculate the count. As also defined in the
user guides, it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't have a
strict requirement for the count to be exact. There are streaming
algorithms that will provide good cardinality estimations within a
predefined failure rate, I think the most popular one seems to be the
(Hyper)LogLog algorithm, also there's an optimal one developed recently,
please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

If you want to take a look at the Java implementation for LogLog,
Clearspring has both LogLog and space optimized HyperLogLog available at
https://github.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra. The
distributed nature of all these algorithms can easily be adapted to
Cassandra's model. I think most of us would love to see come cardinality
estimating columns in Cassandra.

Regards,
Utku

Mime
View raw message