cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utku Can Topçu <u...@topcu.gen.tr>
Subject Re: Distinct Counter Proposal for Cassandra
Date Wed, 13 Jun 2012 17:15:03 GMT
Hi Yuki,

I think I should have used the word discussion instead of proposal for the
mailing subject. I have quite some of a design in my mind but I think it's
not yet ripe enough to formalize. I'll try to simplify it and open a Jira
ticket.
But first I'm wondering if there would be any excitement in the community
for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.yuki@gmail.com> wrote:

> You can open JIRA ticket at
> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>
> Just for the input:
>
> I had once implemented HyperLogLog counter to use internally in Cassandra,
> but it turned out I didn't need it so I just put it to gist. You can find
> it here: https://gist.github.com/2597943
>
> The above implementation and most of the other ones (including stream-lib)
> implement the optimized version of the algorithm which counts up to 10^9,
> so may need some work.
>
> Other alternative is self-learning bitmap (
> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
> understanding, is more memory efficient when counting small values.
>
> Yuki
>
> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
>
> Hi All,
>
> Let's assume we have a use case where we need to count the number of
> columns for a given key. Let's say the key is the URL and the column-name
> is the IP address or any cardinality identifier.
>
> The straight forward implementation seems to be simple, just inserting the
> IP Adresses as columns under the key defined by the URL and using get_count
> to count them back. However the problem here is in case of large rows
> (where too many IP addresses are in); the get_count method has to
> de-serialize the whole row and calculate the count. As also defined in the
> user guides, it's not an O(1) operation and it's quite costly.
>
> However, this problem seems to have better solutions if you don't have a
> strict requirement for the count to be exact. There are streaming
> algorithms that will provide good cardinality estimations within a
> predefined failure rate, I think the most popular one seems to be the
> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>
> If you want to take a look at the Java implementation for LogLog,
> Clearspring has both LogLog and space optimized HyperLogLog available at
> https://github.com/clearspring/stream-lib
>
> I don't see a reason why this can't be implemented in Cassandra. The
> distributed nature of all these algorithms can easily be adapted to
> Cassandra's model. I think most of us would love to see come cardinality
> estimating columns in Cassandra.
>
> Regards,
> Utku
>
>
>

Mime
View raw message