incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David <>
Subject Cassandra for live statistics aggregation ?
Date Mon, 10 May 2010 11:28:23 GMT

I am investigating the use of Cassandra to gather and aggregate simple 
statistics in real time from multiple sources, something quite similar 
to what is described there: I 
have a few questions about how to design a model in cassandra, 
especially w.r.t. the lack of atomic increment operations.

Simply said, I would have a server which would receive requests as follows:


And I would like to keep track of the number of such requests in time 
ranges ("how many requests / hour for the last few days"), for arbitrary 
combinations of {property1 : value1, ..., propertyN: valueN}. The goal 
is to cope with a few thousand requests / sec on the write side, and to 
get acceptable latency for queries (ideally ~ 1 sec/query for a few 
queries / sec).

Given those constraints, I am considering making time-based buckets, 
where I would count the number of requests for each property combination 
on a hourly-basis, daily-basis, etc...

	- the most obvious one, prefixing the key with the timestamp to use 
keyrange-based queries. Unfortunately, this seems to require an ordered 
partitioner, which sound like a bad idea here as writes would happen on 
one node at any given time.
	- another solution I can think of is to keep a column family per bucket 
(one for daily count, etc...), the key would be 
bucket_id:hash({property1: value1, ...}), and the columns would be the 
corresponding time_stamp for this bucket and set of properties. This is 
easy to write and read for the queries I care. Problem: I understand 
that cassandra scales to a few millions columns/row, and this solution 
may requires many more for bucket which are coarser than a day.
	- more involved: using a timestamp-based key but implementing my own 
partitioner to partition the writes.

I am quite new to key-value store, so there may be some other simple 
solutions to this problem ? Most examples I found on the internet were 
too incomplete or assumed range queries over key containing timestamps 
(thus requiring the ordered partitioner).



View raw message