From user-return-25895-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Wed May 2 01:32:51 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B1469DDB for ; Wed, 2 May 2012 01:32:51 +0000 (UTC) Received: (qmail 21093 invoked by uid 500); 2 May 2012 01:32:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 21066 invoked by uid 500); 2 May 2012 01:32:48 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 21058 invoked by uid 99); 2 May 2012 01:32:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2012 01:32:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a48.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2012 01:32:41 +0000 Received: from homiemail-a48.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a48.g.dreamhost.com (Postfix) with ESMTP id 90D254F805B for ; Tue, 1 May 2012 18:32:18 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=MgcmNzp2qa TPEBa12QHbJ6KtLD2CEIA4Secy3Z2TEbj4mq77WYdjgaxB0GgoM5Je69Xvnt4E5k LnHud8BtX90UoR7d1ujBb+0q6zGPoidI1Fs8Y/D4gAGLNfzk28mQWSnJEpJO2Pz7 T3XvNFT9nRzkOxUFXvICzDztCtbGv5R+M= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=pRqAfL7rsnfLMB3B U6sv+2oBJAs=; b=ZS+vE3XnrniUNOMQEtD18gVNj/qKkeQ4LEXICabCfF/kESFU KOTQzValsScrNE6pkgcHU056qUYTqE+TqgJoldKT9q2oGClUe5JLxrptP8Sav+y5 pSTfx0ehYZQgRymTI0nvil/kb5FgZueS4MFOOmucia3m2rxxj5yMyEp9Owo= Received: from [10.8.0.150] (unknown [72.28.97.147]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a48.g.dreamhost.com (Postfix) with ESMTPSA id 0D39F4F8055 for ; Tue, 1 May 2012 18:32:17 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_6A574C58-B3EF-495C-A7F6-854A6B739883" Subject: Re: Data modeling advice (time series) Date: Wed, 2 May 2012 13:32:09 +1200 In-Reply-To: To: user@cassandra.apache.org References: <1335892802.15782.7.camel@tim-laptop> Message-Id: <86058145-2BDC-4B13-A5F2-AD8ABC6D26E3@thelastpickle.com> X-Mailer: Apple Mail (2.1257) --Apple-Mail=_6A574C58-B3EF-495C-A7F6-854A6B739883 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 I would try to avoid 100's on MB's per row. It will take longer to = compact and repair.=20 10's is fine. Take a look at in_memory_compaction_limit and = thrift_frame_size in the yaml file for some guidance. Cheers =20 ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 2/05/2012, at 6:00 AM, Aaron Turner wrote: > On Tue, May 1, 2012 at 10:20 AM, Tim Wintle = wrote: >> I believe that the general design for time-series schemas looks >> something like this (correct me if I'm wrong): >>=20 >> (storing time series for X dimensions for Y different users) >>=20 >> Row Keys: "{USET_ID}_{TIMESTAMP/BUCKETSIZE}" >> Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> {Counter} >>=20 >> But I've not found much advice on calculating optimal bucket sizes = (i.e. >> optimal number of columns per row), and how that decision might be >> affected by compression (or how significant the performance = differences >> between the two options might be). >>=20 >> Are the calculations here are still considered valid (proportionally) = in >> 1.X, with the changes to SSTables, or is it significantly different? >>=20 >> = >=20 >=20 > Tens or a few hundred MB per row seems reasonable. You could do > thousands/MB if you wanted to, but that can make things harder to > manage. >=20 > Depending on the size of your data, you may find that the overhead of > each column becomes significant; far more then the per-row overhead. > Since all of my data is just 64bit integers, I ended up taking a days > worth of values (288/day @ 5min intervals) and storing it as a single > column as a vector. Hence I have two CF's: >=20 > StatsDaily -- each row =3D=3D 1 day, each column =3D 1 stat @ 5min = intervals > StatsDailyVector -- each row =3D=3D 1 year, each column =3D 288 stats = @ 1 > day intervals >=20 > Every night a job kicks off and converts each row's worth of > StatsDaily into a column in StatsDailyVector. By doing it 1:1 this > way, I also reduce the number of tombstones I need to write in > StatsDaily since I only need one tombstone for the row delete, rather > then 288 for each column deleted. >=20 > I don't use compression. >=20 >=20 >=20 > --=20 > Aaron Turner > http://synfin.net/ Twitter: @synfinatic > http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix = & Windows > Those who would give up essential Liberty, to purchase a little = temporary > Safety, deserve neither Liberty nor Safety. > -- Benjamin Franklin > "carpe diem quam minimum credula postero" --Apple-Mail=_6A574C58-B3EF-495C-A7F6-854A6B739883 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 I = would try to avoid 100's on MB's per row. It will take longer to = compact and repair. 

10's is fine. Take a look = at in_memory_compaction_limit and thrift_frame_size in the yaml file for = some guidance.

Cheers
 
http://www.thelastpickle.com

On 2/05/2012, at 6:00 AM, Aaron Turner wrote:

On = Tue, May 1, 2012 at 10:20 AM, Tim Wintle <timwintle@gmail.com> = wrote:
I believe that the general design = for time-series schemas looks
something like this (correct me if I'm = wrong):

(storing time = series for X dimensions for Y different = users)

Row Keys: =  "{USET_ID}_{TIMESTAMP/BUCKETSIZE}"
Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> = {Counter}

But I've not = found much advice on calculating optimal bucket sizes = (i.e.
optimal number of = columns per row), and how that decision might = be
affected by compression (or = how significant the performance differences
between the two options might = be).

Are the = calculations here are still considered valid (proportionally) = in
1.X, with the changes to = SSTables, or is it significantly different?

<http://btoddb-cass-storage.blogspot.co.uk/20= 11/07/column-overhead-and-sizing-every-column.html>


Tens or a few hundred MB per row seems reasonable.  You = could do
thousands/MB if you wanted to, but that can make things = harder to
manage.

Depending on the size of your data, you may = find that the overhead of
each column becomes significant; far more = then the per-row overhead.
Since all of my data is just 64bit = integers, I ended up taking a days
worth of values (288/day @ 5min = intervals) and storing it as a single
column as a vector.  Hence = I have two CF's:

StatsDaily  -- each row =3D=3D 1 day, each = column =3D 1 stat @ 5min intervals
StatsDailyVector -- each row =3D=3D = 1 year, each column =3D 288 stats @ 1
day intervals

Every = night a job kicks off and converts each row's worth of
StatsDaily = into a column in StatsDailyVector.  By doing it 1:1 this
way, I = also reduce the number of tombstones I need to write in
StatsDaily = since I only need one tombstone for the row delete, rather
then 288 = for each column deleted.

I don't use = compression.



--
Aaron Turner
http://synfin.net/      =    Twitter: @synfinatic
http://tcpreplay.synfin.net/ - = Pcap editing and replay tools for Unix & Windows
Those who would = give up essential Liberty, to purchase a little temporary
Safety, = deserve neither Liberty nor Safety.
    -- Benjamin = Franklin
"carpe diem quam minimum credula = postero"

= --Apple-Mail=_6A574C58-B3EF-495C-A7F6-854A6B739883--