Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C387CCD35 for ; Thu, 13 Jun 2013 18:52:00 +0000 (UTC) Received: (qmail 37151 invoked by uid 500); 13 Jun 2013 18:51:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37124 invoked by uid 500); 13 Jun 2013 18:51:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37116 invoked by uid 99); 13 Jun 2013 18:51:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 18:51:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of janne.jalkanen@ecyrd.com designates 87.108.86.67 as permitted sender) Received: from [87.108.86.67] (HELO mail.ecyrd.com) (87.108.86.67) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 18:51:52 +0000 Received: from [192.168.0.14] (cs78178071.pp.htv.fi [62.78.178.71]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.ecyrd.com (Postfix) with ESMTPSA id ABA7297C1AB for ; Thu, 13 Jun 2013 21:51:30 +0300 (EEST) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Billions of counters From: Janne Jalkanen In-Reply-To: Date: Thu, 13 Jun 2013 21:51:30 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <392FACAC-DA63-4CBE-9B30-92F873817685@ecyrd.com> References: To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org Hi! We have a similar situation of millions of events on millions of items - = turns out that this isn't really a problem, because there tends to be a = very strong power -distribution: very few of the items get a lot of = hits, some get some, and the majority gets no hits (though most of them = do get hits every now and then). So it's basically a sparse = multidimensional array, and turns out that Cassandra is pretty good at = storing those. We just treat a missing counter column as zero, and add = a counter only when necessary. To avoid I/O, we also do some = statistical sampling for certain counters where we don't need an exact = figure. YMMV, of course, but I'd look at the likelihood of all the products = being purchased from the same location during one week at least once and = start the modeling from there. :) /Janne On 13 Jun 2013, at 21:19, Darren Smythe wrote: > We want to precalculate counts for some common metrics for usage. We = have events, locations, products, etc. The problem is we have millions = events/day, thousands of locations and millions of products. >=20 > Were trying to precalculate counts for some common queries like 'how = many times was product X purchased in location Y last week'. >=20 > It seems like we'll end up with trillions of counters for even these = basic permutations. Is this a cause for concern? >=20 > TIA >=20 > -- Darren