cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: data aggregation in Cassandra
Date Sun, 27 Mar 2011 21:27:12 GMT
You can do range based questions inside of one row. For example one row has all of the observations
for one day, each observation is represented as a column where (at least the start of the
name) is the time of the observation. You can have to 2 billion columns in one row, and the
column names are sorted according to the comparator you specify. 

If you were to use OOP and say use a time stamp for the key it's going to be difficult to
balance the ring. The new writes will happen in the highest range of the ring, so they would
be concentrated in the last few nodes in your ring. 

A lot depends on your work load. But I would recommend starting with the RP and partitioning
the data into rows based on something like a day. 
Hope that helps.
On 27 Mar 2011, at 15:49, Saurabh Sehgal wrote:

> Thanks for the reply. The reason I want to go with OPP is to do range based queries on
time. All queries against the data are going to be time based. With an RPP partitioning scheme,
will it be efficient to do range based queries ? 
> On Mar 26, 2011 9:12 PM, "aaron morton" <> wrote:
> > If you are using OPP you will need to understand how to balance the data around
the ring, start with RP until you have an idea why it's now working for you. The RP will transform
the key with a hash function, which is then compared to the node tokens to locate the first
replica for the data. The OPP uses the raw key. see
> > 
> > 
> > Reading 20 to 30 million records will take a while. Perhaps look at
and for background. 
> > 
> > Consider how you can dernormalise to support your queries. e.g. in a CF use keys
such as "attr1/value" column name as the time stamp and value as the stuff you need (you could
pack all the data you need into a structure like JSON )
> > 
> > CF's have a (potentially) large memory overhead. Use fewer and store mixed but related
content in them. 
> > 
> > Hope that helps. 
> > Aaron
> > 
> > 
> > On 26 Mar 2011, at 05:38, Saurabh Sehgal wrote:
> > 
> >> Thanks for all the responses. 
> >> 
> >> My leading questions then are ->
> >> 
> >> - Should I go with the OrderPreservingPartitioner based on timestamps so I can
do time range queries - is this recommended ? any special cases regarding load balancing I
need to keep in mind ? I have read buzz over blogs/forums on how RandomPartitioner yields
better load balancing, and it is discouraged to use OrderPreservingPartitioner. Can someone
expand/comment on this ?
> >> 
> >> - Also, lets say I query all partitioned data between timestampuuid1 and timestampuuid2
(over several weeks) .. this would potentially , in my case, return anywhere to 20 - 30 million
records. How would I go about aggregating this data "by hand" ? Will this perform ?
> >> 
> >> Since I am only interested in aggregating over a finite set of 10-20 attributes.
Does it make more sense to have a column family per finite attribute ? In this case, I do
not need to do any aggregation, since all the data for that attribute resides in one column
family. Is there an upper bound to the number of column families Cassandra currently supports
> >> 
> >> 
> >> 
> >> On Fri, Mar 25, 2011 at 7:31 AM, buddhasystem <> wrote:
> >> Hello Saurabh,
> >> 
> >> I have a similar situation, with a more complex data model, and I do an
> >> equivalent of map-reduce "by hand". The redeeming value is that you have
> >> complete freedom in how you hash, and you design the way you store indexes
> >> and similar structures. If there is a pattern in data store, you use it to
> >> your advantage. In the end, you get good performance.
> >> 
> >> --
> >> View this message in context:
> >> Sent from the mailing list archive at
> > 

View raw message