incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: data aggregation in Cassandra
Date Sun, 27 Mar 2011 04:10:30 GMT
If you are using OPP you will need to understand how to balance the data around the ring, start
with RP until you have an idea why it's now working for you. The RP will  transform the key
with a hash function, which is then compared to the node tokens to locate the first replica
for the data. The OPP uses the raw key. see http://wiki.apache.org/cassandra/Operations#Ring_management
and http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

 
Reading 20 to 30 million records will take a while. Perhaps look at http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
and http://www.datastax.com/products/brisk for background. 

Consider how you can dernormalise to support your queries. e.g. in a CF use keys such as "attr1/value"
column name as the time stamp and value as the stuff you need (you could pack all the data
you need into a structure like JSON )

CF's have a (potentially) large memory overhead. Use fewer and store mixed but related content
in them. 
  
Hope that helps. 
Aaron


On 26 Mar 2011, at 05:38, Saurabh Sehgal wrote:

> Thanks for all the responses. 
> 
> My leading questions then are ->
> 
> - Should I go with the OrderPreservingPartitioner based on timestamps so I can do time
range queries - is this recommended ? any special cases regarding load balancing I need to
keep in mind ? I have read buzz over blogs/forums on how RandomPartitioner yields better load
balancing, and it is discouraged to use OrderPreservingPartitioner. Can someone expand/comment
on this ?
> 
> - Also, lets say I query all partitioned data between timestampuuid1 and timestampuuid2
(over several weeks) .. this would potentially , in my case, return anywhere to 20 - 30 million
records. How would I go about aggregating this data "by hand" ? Will this perform ?
> 
> Since I am only interested in aggregating over a finite set of 10-20 attributes. Does
it make more sense to have a column family per finite attribute ? In this case, I do not need
to do any aggregation, since all the data for that attribute resides in one column family.
Is there an upper bound to the number of column families Cassandra currently supports ?
> 
> 
> 
> On Fri, Mar 25, 2011 at 7:31 AM, buddhasystem <potekhin@bnl.gov> wrote:
> Hello Saurabh,
> 
> I have a similar situation, with a more complex data model, and I do an
> equivalent of map-reduce "by hand". The redeeming value is that you have
> complete freedom in how you hash, and you design the way you store indexes
> and similar structures. If there is a pattern in data store, you use it to
> your advantage. In the end, you get good performance.
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/data-aggregation-in-Cassandra-tp6206994p6207879.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.


Mime
View raw message