incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Data modeling advice (time series)
Date Wed, 02 May 2012 01:32:09 GMT
I would try to avoid 100's on MB's per row. It will take longer to compact and repair. 

10's is fine. Take a look at in_memory_compaction_limit and thrift_frame_size in the yaml
file for some guidance.

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 2/05/2012, at 6:00 AM, Aaron Turner wrote:

> On Tue, May 1, 2012 at 10:20 AM, Tim Wintle <timwintle@gmail.com> wrote:
>> I believe that the general design for time-series schemas looks
>> something like this (correct me if I'm wrong):
>> 
>> (storing time series for X dimensions for Y different users)
>> 
>> Row Keys:  "{USET_ID}_{TIMESTAMP/BUCKETSIZE}"
>> Columns: "{DIMENSION_ID}_{TIMESTAMP%BUCKETSIZE}" -> {Counter}
>> 
>> But I've not found much advice on calculating optimal bucket sizes (i.e.
>> optimal number of columns per row), and how that decision might be
>> affected by compression (or how significant the performance differences
>> between the two options might be).
>> 
>> Are the calculations here are still considered valid (proportionally) in
>> 1.X, with the changes to SSTables, or is it significantly different?
>> 
>> <http://btoddb-cass-storage.blogspot.co.uk/2011/07/column-overhead-and-sizing-every-column.html>
> 
> 
> Tens or a few hundred MB per row seems reasonable.  You could do
> thousands/MB if you wanted to, but that can make things harder to
> manage.
> 
> Depending on the size of your data, you may find that the overhead of
> each column becomes significant; far more then the per-row overhead.
> Since all of my data is just 64bit integers, I ended up taking a days
> worth of values (288/day @ 5min intervals) and storing it as a single
> column as a vector.  Hence I have two CF's:
> 
> StatsDaily  -- each row == 1 day, each column = 1 stat @ 5min intervals
> StatsDailyVector -- each row == 1 year, each column = 288 stats @ 1
> day intervals
> 
> Every night a job kicks off and converts each row's worth of
> StatsDaily into a column in StatsDailyVector.  By doing it 1:1 this
> way, I also reduce the number of tombstones I need to write in
> StatsDaily since I only need one tombstone for the row delete, rather
> then 288 for each column deleted.
> 
> I don't use compression.
> 
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"


Mime
View raw message