Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 62D5B938F for ; Mon, 31 Oct 2011 18:08:33 +0000 (UTC) Received: (qmail 87907 invoked by uid 500); 31 Oct 2011 18:08:31 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 87881 invoked by uid 500); 31 Oct 2011 18:08:31 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 87873 invoked by uid 99); 31 Oct 2011 18:08:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 18:08:31 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 18:08:24 +0000 Received: by wyg30 with SMTP id 30so1586651wyg.31 for ; Mon, 31 Oct 2011 11:08:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.138.29 with SMTP id z29mr3532197wei.4.1320084483759; Mon, 31 Oct 2011 11:08:03 -0700 (PDT) Received: by 10.216.22.80 with HTTP; Mon, 31 Oct 2011 11:08:03 -0700 (PDT) In-Reply-To: References: Date: Mon, 31 Oct 2011 11:08:03 -0700 Message-ID: Subject: Re: data model for unique users in a time period From: Ed Anuff To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thanks, good point, splitting wide rows via sharding is a good optimization for the get_count approach. On Mon, Oct 31, 2011 at 10:58 AM, Zach Richardson wrote: > Ed, > > I could be completely wrong about this working--I haven't specifically > looked at how the counts are executed, but I think this makes sense. > > You could potentially shard across several rows, based on a hash of > the username combined with the time period as the row key. =A0Run a > count across each row and then add them up. =A0If your cluster is large > enough this could spread the computation enough to make each query for > the count a bit faster. > > Depending on how often this query would be hit, I would still > recommend caching, but you could calculate reality a little more > often. > > Zach > > > On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff wrote: >> I'm looking at the scenario of how to keep track of the number of >> unique visitors within a given time period. =A0Inserting user ids into a >> wide row would allow me to have a list of every user within the time >> period that the row represented. =A0My experience in the past was that >> using get_count on a row to get the column count got slow pretty quick >> but that might still be the easiest way to get the count of unique >> users with some sort of caching of the count so that it's not >> expensive subsequently. =A0Using Hadoop is overkill for this scenario. >> Any other approaches? >> >> Ed >> >