Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 419F979EA for ; Mon, 12 Sep 2011 18:03:09 +0000 (UTC) Received: (qmail 15005 invoked by uid 500); 12 Sep 2011 18:03:07 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 14965 invoked by uid 500); 12 Sep 2011 18:03:07 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 14957 invoked by uid 99); 12 Sep 2011 18:03:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Sep 2011 18:03:07 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.54] (HELO mail-ww0-f54.google.com) (74.125.82.54) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Sep 2011 18:03:01 +0000 Received: by wwf5 with SMTP id 5so859146wwf.23 for ; Mon, 12 Sep 2011 11:02:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.53.19 with SMTP id f19mr1400774wec.41.1315850560261; Mon, 12 Sep 2011 11:02:40 -0700 (PDT) Received: by 10.216.70.84 with HTTP; Mon, 12 Sep 2011 11:02:40 -0700 (PDT) X-Originating-IP: [216.113.168.128] In-Reply-To: References: Date: Mon, 12 Sep 2011 14:02:40 -0400 Message-ID: Subject: Re: Using couchdb for analytics From: Sam Bisbee To: user@couchdb.apache.org Cc: sandy Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi, The first part of my answer is not CouchDB specific. All of the big analytics systems that I have ever built or seen at my clients' have used queues. Since, as you know, analytics can have such a high write rate you would be crazy to try and persist each transaction to disk (which is what databases do). Instead send them to a queue where they can sit and you can consume them at your own leisure. If you don't want to host your own queue, then take a look at Amazon Simple Queue Service. Now, for the CouchDB part. Have each transaction be its own document. Yes, even if you are tracking the same type of action for the same resource (URL). You no longer live in a locking world, so this is the most straight forward approach. Now you can build views that use actions, resources, or whatever other piece of data that you want. More information at http://guide.couchdb.org/draft/recipes.html Given the write rate of analytics systems you would be right to worry about view build time. That's why you have the queue: you can control the write rate in CouchDB. You can also just build views once per night (or whatever), and ALWAYS query with ?stale=3Dok so you don't kick off a view build at read time. There's a bunch more land mines, but these are the basics and should get you on your way. :) -- Sam Bisbee On Thu, Jun 2, 2011 at 5:34 AM, maku@makuchaku.in wrote= : > Hi everyone, > > I came across couchdb a couple of weeks back & got really excited by > the fundamental change it brings by simply taking the app-server out > of the picture. > Must say, kudos to the dev team! > > I am planning to write a quick analytics solution for my website - > something on the lines of Google analytics - which will measure > certain properties of the visitors hitting our site. > > Since this is my first attempt at a JSON style document store, I > thought I'll share the architecture & see if I can make it better (or > correct my mistakes before I do them) :-) > > - For each unique visitor, create a document with his session_id as the d= oc.id > - For each property i need to track about this visitor, I create a > key-value pair in the doc created for this visitor > - If visitor is a returning user, use the session_id to re-open his > doc & keep on modifying the properties > - At end of each calculation time period (say 1 hour or 24 hours),=C2=A0I > run a cron job which fires the map-reduce jobs by requesting the views > over curl/http. > > A couple of questions based on above architecture... > We see concurrent traffic ranging from 2k users to 5k users. > - Would a couchdb instance running on a good machine (say High CPU > EC2, medium instance) work well with simultaneous writes happening... > (visitors browsing, properties changing or getting created) > - With a couple of million documents, would I be able to process my > views without causing any significant impact to write performance? > > I think my questions might be biased by the fact that I come from a > MySQL/Rails background... :-) > > Let me know how you guys think about this. > > Thanks in advance, > -- > Mayank > http://adomado.com >