cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: sensible data model ?
Date Mon, 06 Feb 2012 19:39:10 GMT
Sounds like a good start. Super columns are not a great fit for modeling time series data for
a few reasons, here is one http://wiki.apache.org/cassandra/CassandraLimitations

It's also a good idea to partition time series data so that the rows do not grow too big.
You can have 2 billion columns in a row, but big rows have operational down sides.

You could go with either:

rows: <entity_id:date>
column: <property_name>

Which would mean each time your query for a date range you need to query multiple rows. But
it is possible to get a range of  columns / properties.

Or

rows: <entity_id:time_partition>
column: <date:property_name>

Where time_partition is something that makes sense in your problem domain, e.g. a calendar
month. If you often query for days in a month you  can then get all the columns for the days
you are interested in (using a column range). If you only want to get a sub set of the entity
properties you will need to get them all and filter them client side, depending on the number
and size of the properties this may be more efficient than multiple calls. 

One word of warning, avoid sending read requests for lots (i.e. 100's) of rows at once it
will reduce overall query throughput. Some clients like pycassa take care of this for you.

Good luck. 
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/02/2012, at 12:12 AM, Franc Carter wrote:

> 
> Hi,
> 
> I'm pretty new to Cassandra and am currently doing a proof of concept, and thought it
would be a good idea to ask if my data model is sane . . . 
> 
> The data I have, and need to query, is reasonably simple. It consists of about 10 million
entities, each of which have a set of key/value properties for each day for about 10 years.
The number of keys is in the 50-100 range and there will be a lot of overlap for keys in <entity,days>
> 
> The queries I need to make are for sets of key/value properties for an entity on a day,
e.g key1,keys2,key3 for 10 entities on 20 days. The number of entities and/or days in the
query could be either very small or very large.
> 
> I've modeled this with a simple column family for the keys with the row key being the
concatenation of the entity and date. My first go, used only the entity as the row key and
then used a supercolumn for each date. I decided against this mostly because it seemed more
complex for a gain I didn't really understand.
> 
> Does this seem sensible ?
> 
> thanks
> 
> -- 
> Franc Carter | Systems architect | Sirca Ltd
> franc.carter@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118 
> Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
> 


Mime
View raw message