Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of th.heller@gmail.com
 designates 74.125.83.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:date:x-google-sender-auth:message-id:subject
         :from:to:content-type;
        b=j6tQFPmSXkY6f1oiM3EOjeaF57IswPUW8jh3xBMnTyIQBaZ5hEVf9LlOarf5IXbc9B
         t7LGpnCoBsubKzymVhj3JHfWoiqh5d+c/qVXmSo8BkrEMCFpRe2ddNMctMwhED7GSqa8
         6pzrq0hmxW2V4+lectgnNRrPhh28vox6crxME=
MIME-Version: 1.0
Sender: th.heller@gmail.com
Date: Sun, 13 Jun 2010 04:16:26 +0200
Message-ID: <AANLkTinw6rVVlxlv7oMsLZBdHaKIGN5BW4MeiMBQPy9-@mail.gmail.com>
Subject: Beginner Assumptions
From: Thomas Heller <info@zilence.net>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

Hey,

I've been planning to play arround with Cassandra for quite some time
and finally got arround to it. I like what I've seen/used so far alot
but my SQL-brain keeps popping up and trying to convince me that SQL
is fine.

Anyways, I want to store some (alot of) Time Series data in Cassandra
and would like to check if my assumptions are correct so far. So if
someone with operational experience could confirm these I'd really
appreciate it.

Basically the structure I'm going for right now looks like this:

One CF with LongType Keys which represent a day (eg. 20100612,
20100613, ...). Each value is a simple Time Series which is just a
list of 24 Integers (1 Counter for every Hour) packed into 96 bytes
(24x4byte).

Then I have alot of rows which each accumulate one column per day. Put
in Web Analytics terms I might count the number of views a page gets:

row:"/page/1" => cols:[20100612 => [12,34,...], 20100613 => [34,93,...], ...]
row:"/page/2" => cols:[20100612 => [1,...], ...]

Over a couple of years I would collect "millions" of rows, each with
"hundreds" of columns.

So, Assumption #1:

Only the row key decides where the data lives (via consistent
hashing)? So each tuple for a row lives on the same node(s) which in
turn makes querying for slices of columns fast. I really need fast
queries (It is fast in my tests but I'm working on a very small subset
only).

Assumption #2:

Basically the only query for this CF will always look like "get <some
date range> of data for <row>". I can actually just get a slice of
columns using 'start' and 'count' and this would perform just as fast
(or faster) than building my list of keys on the client and doing a
multi get?

Beware SQL! Translated to SQL (since this is what my brain does all the time):
SELECT data FROM time_series WHERE key = '/page/1' ORDER BY day DESC LIMIT 90;
vs
SELECT data FROM time_series WHERE key = '/page/1' AND day IN
('2010-06-13', '2010-06-12', ...);
vs
memcache.get(['20100613:/page/1', '20100612:/page/1', ...])

Assumption #3:

Since the data grows in a fixed rate per row and only the number of
rows varies it should be simple enough to predict storage
requirements. Rows are "equally" distributed on the cluster (using
RandomPartitioner) and should a node reach its capacity limit the
cluster will migrate rows to new nodes. Making it easy to scale out.
Thats the point right? :P

Assumption #4:

I might update the current day data multiple times until the day
passes and the data becomes immutable. It is ok for Clients to see old
data but the data must be "correct" at some point (eventually
consistent ha!). This seems to be solved, just something the SQL Devil
on my shoulder keeps bugging me about.

-

I think I "got" it and will get my hands dirty soon, just wanted to
squash my last doubts. I've done this on Riak too but I wasnt too
happy with it. Cassandra feels "right" although it took some Jedi Mind
Tricks to grasp SuperColumns.

Anyways, feedback is greatly appreciated.

Thanks,
/thomas