Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B9C0DC50 for ; Fri, 29 Jun 2012 13:28:17 +0000 (UTC) Received: (qmail 97663 invoked by uid 500); 29 Jun 2012 13:28:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 97616 invoked by uid 500); 29 Jun 2012 13:28:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 97559 invoked by uid 99); 29 Jun 2012 13:28:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 13:28:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chris.burroughs@gmail.com designates 209.85.213.172 as permitted sender) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 13:28:04 +0000 Received: by yenq13 with SMTP id q13so3007983yen.31 for ; Fri, 29 Jun 2012 06:27:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=1eZ+UupIe5symtJPt+2xBZ3tPRIITYW9deciHt1mR90=; b=kK/7iDFe2CD1y7SzQ/K2FfWMjNhhXZHpbetWaNg+AlnRkwMiuDVGNZnNkxXSX6jRLe 3krA9pPmnTFLt0QEObhUd3FDlZBUq7gXIrDK6j2Of14sQSL4KoCMxwXFsZrJtFcWfi2r r/t4tZsXa/PnOZplViUbpqJ7ewAYIpYCcRfPWi67tZV3nG3qsl+d7MPl5ZGGflf0hkGk hsXqYoA8wIEVpaN4oralUycy15Dyn+d7EFk2jY29hXNE72j6cncMXoXWFOeN0J12AE6E WFsAOSnfeZBMBcgmuLuUD2HuA/8LwMZiM7XZlYxAv7ejcWoqviqUW7KWOGk1ZLrAoHap sxGw== Received: by 10.236.77.134 with SMTP id d6mr2580879yhe.79.1340976463900; Fri, 29 Jun 2012 06:27:43 -0700 (PDT) Received: from [10.10.49.6] (cl-pat-tr.clearspring.com. [8.18.54.254]) by mx.google.com with ESMTPS id f68sm5560932yhh.22.2012.06.29.06.27.42 (version=SSLv3 cipher=OTHER); Fri, 29 Jun 2012 06:27:43 -0700 (PDT) Message-ID: <4FEDAD4D.9070309@gmail.com> Date: Fri, 29 Jun 2012 09:27:41 -0400 From: Chris Burroughs User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: user@cassandra.apache.org CC: Yuki Morishita Subject: Re: Distinct Counter Proposal for Cassandra References: <62C288C441C240D9998F0F6066F9A487@gmail.com> In-Reply-To: <62C288C441C240D9998F0F6066F9A487@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 06/13/2012 01:00 PM, Yuki Morishita wrote: > The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work. > > Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values. The closest we could get to a one-size fits all would probably be an adaptive counting scheme that uses linear counting (or self-learning bitmap, didn't know about that one!) for small expected cardinalities and a LogLog variant for higher ones. It's more choices to make, but choosing between "not too big" and "really really big" doesn't seem like an unreasonable burden to me.