Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 96153 invoked from network); 13 Jun 2010 02:16:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Jun 2010 02:16:56 -0000 Received: (qmail 59058 invoked by uid 500); 13 Jun 2010 02:16:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 59018 invoked by uid 500); 13 Jun 2010 02:16:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 59010 invoked by uid 99); 13 Jun 2010 02:16:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Jun 2010 02:16:54 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of th.heller@gmail.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Jun 2010 02:16:48 +0000 Received: by gwj16 with SMTP id 16so1829067gwj.31 for ; Sat, 12 Jun 2010 19:16:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=CZlXYiEMjH5kY/jEaZlH9+phW7n79lIzL7OJmoGnAb0=; b=VssIrGtlti90Er8UAQcocp9JIP8ZOg1hwL/ArsXi6Eou49Ti0FtHE6SMheWdNOEe6g /B5N9Dqnb6ErDJ3w5a4BmPRnZ+vxwxxHp61IJviJpAKfc3mWNwhZ3Q/JgwjaqZ33kb+f Moq8bjR154X+ESpHJjCvJYeolLDITkOLgO2QA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=j6tQFPmSXkY6f1oiM3EOjeaF57IswPUW8jh3xBMnTyIQBaZ5hEVf9LlOarf5IXbc9B t7LGpnCoBsubKzymVhj3JHfWoiqh5d+c/qVXmSo8BkrEMCFpRe2ddNMctMwhED7GSqa8 6pzrq0hmxW2V4+lectgnNRrPhh28vox6crxME= MIME-Version: 1.0 Received: by 10.101.131.21 with SMTP id i21mr3291431ann.225.1276395387092; Sat, 12 Jun 2010 19:16:27 -0700 (PDT) Sender: th.heller@gmail.com Received: by 10.100.178.18 with HTTP; Sat, 12 Jun 2010 19:16:26 -0700 (PDT) Date: Sun, 13 Jun 2010 04:16:26 +0200 X-Google-Sender-Auth: YiV5YcAB1tfC45a41hXOdfBLlmM Message-ID: Subject: Beginner Assumptions From: Thomas Heller To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 Hey, I've been planning to play arround with Cassandra for quite some time and finally got arround to it. I like what I've seen/used so far alot but my SQL-brain keeps popping up and trying to convince me that SQL is fine. Anyways, I want to store some (alot of) Time Series data in Cassandra and would like to check if my assumptions are correct so far. So if someone with operational experience could confirm these I'd really appreciate it. Basically the structure I'm going for right now looks like this: One CF with LongType Keys which represent a day (eg. 20100612, 20100613, ...). Each value is a simple Time Series which is just a list of 24 Integers (1 Counter for every Hour) packed into 96 bytes (24x4byte). Then I have alot of rows which each accumulate one column per day. Put in Web Analytics terms I might count the number of views a page gets: row:"/page/1" => cols:[20100612 => [12,34,...], 20100613 => [34,93,...], ...] row:"/page/2" => cols:[20100612 => [1,...], ...] Over a couple of years I would collect "millions" of rows, each with "hundreds" of columns. So, Assumption #1: Only the row key decides where the data lives (via consistent hashing)? So each tuple for a row lives on the same node(s) which in turn makes querying for slices of columns fast. I really need fast queries (It is fast in my tests but I'm working on a very small subset only). Assumption #2: Basically the only query for this CF will always look like "get of data for ". I can actually just get a slice of columns using 'start' and 'count' and this would perform just as fast (or faster) than building my list of keys on the client and doing a multi get? Beware SQL! Translated to SQL (since this is what my brain does all the time): SELECT data FROM time_series WHERE key = '/page/1' ORDER BY day DESC LIMIT 90; vs SELECT data FROM time_series WHERE key = '/page/1' AND day IN ('2010-06-13', '2010-06-12', ...); vs memcache.get(['20100613:/page/1', '20100612:/page/1', ...]) Assumption #3: Since the data grows in a fixed rate per row and only the number of rows varies it should be simple enough to predict storage requirements. Rows are "equally" distributed on the cluster (using RandomPartitioner) and should a node reach its capacity limit the cluster will migrate rows to new nodes. Making it easy to scale out. Thats the point right? :P Assumption #4: I might update the current day data multiple times until the day passes and the data becomes immutable. It is ok for Clients to see old data but the data must be "correct" at some point (eventually consistent ha!). This seems to be solved, just something the SQL Devil on my shoulder keeps bugging me about. - I think I "got" it and will get my hands dirty soon, just wanted to squash my last doubts. I've done this on Riak too but I wasnt too happy with it. Cassandra feels "right" although it took some Jedi Mind Tricks to grasp SuperColumns. Anyways, feedback is greatly appreciated. Thanks, /thomas