From user-return-27285-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Fri Jun 29 13:31:43 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0056FD22E for ; Fri, 29 Jun 2012 13:31:43 +0000 (UTC) Received: (qmail 10742 invoked by uid 500); 29 Jun 2012 13:31:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 10717 invoked by uid 500); 29 Jun 2012 13:31:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 10709 invoked by uid 99); 29 Jun 2012 13:31:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 13:31:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chris.burroughs@gmail.com designates 209.85.213.42 as permitted sender) Received: from [209.85.213.42] (HELO mail-yw0-f42.google.com) (209.85.213.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 13:31:35 +0000 Received: by yhfq11 with SMTP id q11so3632239yhf.29 for ; Fri, 29 Jun 2012 06:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=7+E56cMlYFpDbiKcNBtYHkIGECYhUyWt47cXfHKn8oU=; b=u92awaJ1GKjR2Dg1uFdtjvzz4iKYG5vhNEYYdIx+xIUiJn/vyHT8JTCx4LYm7gtJRf aTktwkUpMuqSC+qu65Vxl1/pmWEalp5EwNavQteIIIzMuRR8U9lQWrg/ZiXf/PMjUaog QafjutlvBpOrXvc6+pWUEfcgZdjG7TcVAsjxY1Nl971QHLlFpZ7cTGw14N6HWy/LiCia JJuh0uQk2EieqMGjRTHO09BpCcdOa8VRnLX7Zo+klU4Nycm2r49fmfV0TCS7rsLG8bmW wQghXUSZvr5oDMwohmH6zhy6E6XEQeRPqgVeuuHaK3P7sh9uNGaMMhBk4gko7txS4lwg OXJA== Received: by 10.236.76.165 with SMTP id b25mr2698682yhe.0.1340976675187; Fri, 29 Jun 2012 06:31:15 -0700 (PDT) Received: from [10.10.49.6] (cl-pat-tr.clearspring.com. [8.18.54.254]) by mx.google.com with ESMTPS id c28sm5623052yhk.2.2012.06.29.06.31.13 (version=SSLv3 cipher=OTHER); Fri, 29 Jun 2012 06:31:14 -0700 (PDT) Message-ID: <4FEDAE21.6020908@gmail.com> Date: Fri, 29 Jun 2012 09:31:13 -0400 From: Chris Burroughs User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: user@cassandra.apache.org CC: =?UTF-8?B?VXRrdSBDYW4gVG9ww6d1?= Subject: Re: Distinct Counter Proposal for Cassandra References: <62C288C441C240D9998F0F6066F9A487@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Well I obviously think it would be handy. If this get's proposed and end's up using stream-lib don't be shy about asking for help. On a more general note, it would be great to see the special case Counter code become more general atomic operation code. On 06/13/2012 01:15 PM, Utku Can Topçu wrote: > Hi Yuki, > > I think I should have used the word discussion instead of proposal for the > mailing subject. I have quite some of a design in my mind but I think it's > not yet ripe enough to formalize. I'll try to simplify it and open a Jira > ticket. > But first I'm wondering if there would be any excitement in the community > for such a feature. > > Regards, > Utku > > On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita wrote: > >> You can open JIRA ticket at >> https://issues.apache.org/jira/browse/CASSANDRA with your proposal. >> >> Just for the input: >> >> I had once implemented HyperLogLog counter to use internally in Cassandra, >> but it turned out I didn't need it so I just put it to gist. You can find >> it here: https://gist.github.com/2597943 >> >> The above implementation and most of the other ones (including stream-lib) >> implement the optimized version of the algorithm which counts up to 10^9, >> so may need some work. >> >> Other alternative is self-learning bitmap ( >> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my >> understanding, is more memory efficient when counting small values. >> >> Yuki >> >> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: >> >> Hi All, >> >> Let's assume we have a use case where we need to count the number of >> columns for a given key. Let's say the key is the URL and the column-name >> is the IP address or any cardinality identifier. >> >> The straight forward implementation seems to be simple, just inserting the >> IP Adresses as columns under the key defined by the URL and using get_count >> to count them back. However the problem here is in case of large rows >> (where too many IP addresses are in); the get_count method has to >> de-serialize the whole row and calculate the count. As also defined in the >> user guides, it's not an O(1) operation and it's quite costly. >> >> However, this problem seems to have better solutions if you don't have a >> strict requirement for the count to be exact. There are streaming >> algorithms that will provide good cardinality estimations within a >> predefined failure rate, I think the most popular one seems to be the >> (Hyper)LogLog algorithm, also there's an optimal one developed recently, >> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094 >> >> If you want to take a look at the Java implementation for LogLog, >> Clearspring has both LogLog and space optimized HyperLogLog available at >> https://github.com/clearspring/stream-lib >> >> I don't see a reason why this can't be implemented in Cassandra. The >> distributed nature of all these algorithms can easily be adapted to >> Cassandra's model. I think most of us would love to see come cardinality >> estimating columns in Cassandra. >> >> Regards, >> Utku >> >> >> >