incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Sehgal <saurabh....@gmail.com>
Subject Re: data aggregation in Cassandra
Date Sun, 27 Mar 2011 04:49:33 GMT
Thanks for the reply. The reason I want to go with OPP is to do range based
queries on time. All queries against the data are going to be time based.
With an RPP partitioning scheme, will it be efficient to do range based
queries ?
On Mar 26, 2011 9:12 PM, "aaron morton" <aaron@thelastpickle.com> wrote:
> If you are using OPP you will need to understand how to balance the data
around the ring, start with RP until you have an idea why it's now working
for you. The RP will transform the key with a hash function, which is then
compared to the node tokens to locate the first replica for the data. The
OPP uses the raw key. see
http://wiki.apache.org/cassandra/Operations#Ring_management and
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
>
>
> Reading 20 to 30 million records will take a while. Perhaps look at
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011and
http://www.datastax.com/products/brisk for background.
>
> Consider how you can dernormalise to support your queries. e.g. in a CF
use keys such as "attr1/value" column name as the time stamp and value as
the stuff you need (you could pack all the data you need into a structure
like JSON )
>
> CF's have a (potentially) large memory overhead. Use fewer and store mixed
but related content in them.
>
> Hope that helps.
> Aaron
>
>
> On 26 Mar 2011, at 05:38, Saurabh Sehgal wrote:
>
>> Thanks for all the responses.
>>
>> My leading questions then are ->
>>
>> - Should I go with the OrderPreservingPartitioner based on timestamps so
I can do time range queries - is this recommended ? any special cases
regarding load balancing I need to keep in mind ? I have read buzz over
blogs/forums on how RandomPartitioner yields better load balancing, and it
is discouraged to use OrderPreservingPartitioner. Can someone expand/comment
on this ?
>>
>> - Also, lets say I query all partitioned data between timestampuuid1 and
timestampuuid2 (over several weeks) .. this would potentially , in my case,
return anywhere to 20 - 30 million records. How would I go about aggregating
this data "by hand" ? Will this perform ?
>>
>> Since I am only interested in aggregating over a finite set of 10-20
attributes. Does it make more sense to have a column family per finite
attribute ? In this case, I do not need to do any aggregation, since all the
data for that attribute resides in one column family. Is there an upper
bound to the number of column families Cassandra currently supports ?
>>
>>
>>
>> On Fri, Mar 25, 2011 at 7:31 AM, buddhasystem <potekhin@bnl.gov> wrote:
>> Hello Saurabh,
>>
>> I have a similar situation, with a more complex data model, and I do an
>> equivalent of map-reduce "by hand". The redeeming value is that you have
>> complete freedom in how you hash, and you design the way you store
indexes
>> and similar structures. If there is a pattern in data store, you use it
to
>> your advantage. In the end, you get good performance.
>>
>> --
>> View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/data-aggregation-in-Cassandra-tp6206994p6207879.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at
Nabble.com.
>

Mime
View raw message