incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Turner <synfina...@gmail.com>
Subject Re: Data modeling advice (time series)
Date Wed, 02 May 2012 15:40:54 GMT
On Wed, May 2, 2012 at 8:22 AM, Tim Wintle <timwintle@gmail.com> wrote:
> On Tue, 2012-05-01 at 11:00 -0700, Aaron Turner wrote:
>> Tens or a few hundred MB per row seems reasonable.  You could do
>> thousands/MB if you wanted to, but that can make things harder to
>> manage.
>
> thanks (Both Aarons)
>
>> Depending on the size of your data, you may find that the overhead of
>> each column becomes significant; far more then the per-row overhead.
>> Since all of my data is just 64bit integers, I ended up taking a days
>> worth of values (288/day @ 5min intervals) and storing it as a single
>> column as a vector.
>
> By "vector" do you mean a raw binary array of long ints?

Yep.  I've also done a few small optimizations for when an entire days
data is 0, etc.

> That sounds very nice for reducing overhead - but I'd like to to work
> with counters (I was going to rely on them for streaming "real-time"
> updates).

I was going to use counters for aggregates... but I ended up doing all
the work in the client and storing them the same way as individual
data sources.  Depends on what you're counting really.  Basically with
counters, if you get an error incrementing them, you have no idea if
the value changed or not.  There's other issues too, which have been
discussed here on list and should be in the archives.  Not a big deal
if you're just counting the number of times people have clicked
"Like", but if you're building network traffic aggregates and you fail
to include or double count a 10 slot switch full of 10Gbps ports your
graphs end up looking really bad!

> Is that why you've got the two CFs described below (to have an archived
> summary and a live version that can have counters), or do you have no
> contention over writes/increments for individual values?

Basically if I inserted data as it came in as a vector, I'd have to do
a read for every write (read the current vector, and then write a new
vector with the new value appended to it).  That would destroy
performance, hence the two CF's.  By doing it nightly, it's a lot more
efficient.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Mime
View raw message