Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA7FDC046 for ; Wed, 13 Jun 2012 17:15:53 +0000 (UTC) Received: (qmail 67074 invoked by uid 500); 13 Jun 2012 17:15:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 67054 invoked by uid 500); 13 Jun 2012 17:15:51 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 67045 invoked by uid 99); 13 Jun 2012 17:15:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jun 2012 17:15:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.172] (HELO mail-qc0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jun 2012 17:15:44 +0000 Received: by qcsq13 with SMTP id q13so544679qcs.31 for ; Wed, 13 Jun 2012 10:15:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:from:date :message-id:subject:to:content-type:x-gm-message-state; bh=f9BZSv7Q89AjD+Bbm7btyszO/cpWcutnUF2XosnOFu4=; b=RN4rQzvPFcpFYCavxJTNF7peKvAZbVk+4eHEvJcbdBAz0YRnZkE3uCbobmT1krK71m aa8H/g+p5AownD5VWxFY9Onu4/fBFaaevDImzYtSLJtTUmToIRpq7SqdDVk9nh6lFPDB 2j4H8JoD1lTr2KVAys5ErPtFt+27jFzrWwNOLlv0+M77S6RZGTgpc9V0DQjVzIhJwrBN F1FlNUejefTRHFdwRIDesV0ykEDwAh6O5tVw+t7IwWj7nu+vr/YftK5jEYNOMz+EEqwh z2TxcUlo8SC54Pr4i5QQf+w3nBnfYi7eGcxEhWhaTH9Wpezx+60pAvsjueFWrO+km92f wODw== Received: by 10.229.135.202 with SMTP id o10mr1216835qct.20.1339607723906; Wed, 13 Jun 2012 10:15:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.49.1 with HTTP; Wed, 13 Jun 2012 10:15:03 -0700 (PDT) X-Originating-IP: [87.249.118.67] In-Reply-To: <62C288C441C240D9998F0F6066F9A487@gmail.com> References: <62C288C441C240D9998F0F6066F9A487@gmail.com> From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= Date: Wed, 13 Jun 2012 19:15:03 +0200 Message-ID: Subject: Re: Distinct Counter Proposal for Cassandra To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00248c6a66ba85960404c25dbad5 X-Gm-Message-State: ALoCoQnZOUBFyBiEhOYmMdEXgWsEL+IfB0K9D+yz+Le7/4A2kQchhTjPsm7WxPNZCatuvRwfXGKl --00248c6a66ba85960404c25dbad5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Yuki, I think I should have used the word discussion instead of proposal for the mailing subject. I have quite some of a design in my mind but I think it's not yet ripe enough to formalize. I'll try to simplify it and open a Jira ticket. But first I'm wondering if there would be any excitement in the community for such a feature. Regards, Utku On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita wrote: > You can open JIRA ticket at > https://issues.apache.org/jira/browse/CASSANDRA with your proposal. > > Just for the input: > > I had once implemented HyperLogLog counter to use internally in Cassandra= , > but it turned out I didn't need it so I just put it to gist. You can find > it here: https://gist.github.com/2597943 > > The above implementation and most of the other ones (including stream-lib= ) > implement the optimized version of the algorithm which counts up to 10^9, > so may need some work. > > Other alternative is self-learning bitmap ( > http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my > understanding, is more memory efficient when counting small values. > > Yuki > > On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Top=C3=A7u wrote: > > Hi All, > > Let's assume we have a use case where we need to count the number of > columns for a given key. Let's say the key is the URL and the column-name > is the IP address or any cardinality identifier. > > The straight forward implementation seems to be simple, just inserting th= e > IP Adresses as columns under the key defined by the URL and using get_cou= nt > to count them back. However the problem here is in case of large rows > (where too many IP addresses are in); the get_count method has to > de-serialize the whole row and calculate the count. As also defined in th= e > user guides, it's not an O(1) operation and it's quite costly. > > However, this problem seems to have better solutions if you don't have a > strict requirement for the count to be exact. There are streaming > algorithms that will provide good cardinality estimations within a > predefined failure rate, I think the most popular one seems to be the > (Hyper)LogLog algorithm, also there's an optimal one developed recently, > please check http://dl.acm.org/citation.cfm?doid=3D1807085.1807094 > > If you want to take a look at the Java implementation for LogLog, > Clearspring has both LogLog and space optimized HyperLogLog available at > https://github.com/clearspring/stream-lib > > I don't see a reason why this can't be implemented in Cassandra. The > distributed nature of all these algorithms can easily be adapted to > Cassandra's model. I think most of us would love to see come cardinality > estimating columns in Cassandra. > > Regards, > Utku > > > --00248c6a66ba85960404c25dbad5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Yuki,

I think I should have used the word discussion instead of p= roposal for the mailing subject. I have quite some of a design in my mind b= ut I think it's not yet ripe enough to formalize. I'll try to simpl= ify it and open a Jira ticket.
But first I'm wondering if there would be any excitement in the communi= ty for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <= ;mor.yuki@gmail.com= > wrote:
You can open JIRA ticket at=C2=A0https://issues.apa= che.org/jira/browse/CASSANDRA=C2=A0with your proposal.
<= br>
Just for the input:

I had once implemented Hy= perLogLog counter to use internally in Cassandra, but it turned out I didn&= #39;t need it so I just put it to gist. You can find it here:=C2=A0https://gist.github.= com/2597943

The above implementation and most of the other ones (in= cluding stream-lib) implement the optimized version of the algorithm which = counts up to 10^9, so may need some work.

Other al= ternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen= /sbitmap4p.pdf) which, in my understanding, is more memory efficient wh= en counting small values.

Yuki

=20

On Wednesday, June 13, 2012 at 1= 1:28 AM, Utku Can Top=C3=A7u wrote:

Hi All,

Let's assume we have= a use case where we need to count the number of columns for a given key. L= et's say the key is the URL and the column-name is the IP address or an= y cardinality identifier.

The straight forward implementation seems to be simple, just inserting the = IP Adresses as columns under the key defined by the URL and using get_count= to count them back. However the problem here is in case of large rows (whe= re too many IP addresses are in); the get_count method has to de-serialize = the whole row and calculate the count. As also defined in the user guides, = it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't h= ave a strict requirement for the count to be exact. There are streaming alg= orithms that will provide good cardinality estimations within a predefined = failure rate, I think the most popular one seems to be the (Hyper)LogLog al= gorithm, also there's an optimal one developed recently, please check <= a href=3D"http://dl.acm.org/citation.cfm?doid=3D1807085.1807094" target=3D"= _blank">http://dl.acm.org/citation.cfm?doid=3D1807085.1807094

If you want to take a look at the Java implementation for LogLog, Clear= spring has both LogLog and space optimized HyperLogLog available at https://gi= thub.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra= . The distributed nature of all these algorithms can easily be adapted to C= assandra's model. I think most of us would love to see come cardinality= estimating columns in Cassandra.

Regards,
Utku
=20 =20 =20 =20
=20


--00248c6a66ba85960404c25dbad5--