cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franc Carter <>
Subject Re: sensible data model ?
Date Tue, 07 Feb 2012 01:28:32 GMT
On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <>wrote:

> Sounds like a good start. Super columns are not a great fit for modeling
> time series data for a few reasons, here is one

None of those jump out at me as horrible for my case. If I modelled with
Super Columns I would have less than 10,000 Super Columns with an average
of 50 columns - big but no insane ?

> It's also a good idea to partition time series data so that the rows do
> not grow too big. You can have 2 billion columns in a row, but big rows
> have operational down sides.
> You could go with either:
> rows: <entity_id:date>
> column: <property_name>
> Which would mean each time your query for a date range you need to query
> multiple rows. But it is possible to get a range of  columns / properties.
> Or
> rows: <entity_id:time_partition>
> column: <date:property_name>

That's an interesting idea - I'll talk to the data experts to see if we
have a sensible range.

> Where time_partition is something that makes sense in your problem domain,
> e.g. a calendar month. If you often query for days in a month you  can then
> get all the columns for the days you are interested in (using a column
> range). If you only want to get a sub set of the entity properties you will
> need to get them all and filter them client side, depending on the number
> and size of the properties this may be more efficient than multiple calls.

I'm find with doing work on the client side - I have a bias in that
direction as it tends to scale better.

> One word of warning, avoid sending read requests for lots (i.e. 100's) of
> rows at once it will reduce overall query throughput. Some clients like
> pycassa take care of this for you.

Because of request overhead ? I'm currently using the batch interface of
pycassa to do bulk reads. Is the same problem going to bite me if I have
many clients reading (using bulk reads) ? In production we will have ~50


> Good luck.
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 5/02/2012, at 12:12 AM, Franc Carter wrote:
> Hi,
> I'm pretty new to Cassandra and am currently doing a proof of concept, and
> thought it would be a good idea to ask if my data model is sane . . .
> The data I have, and need to query, is reasonably simple. It consists of
> about 10 million entities, each of which have a set of key/value properties
> for each day for about 10 years. The number of keys is in the 50-100 range
> and there will be a lot of overlap for keys in <entity,days>
> The queries I need to make are for sets of key/value properties for an
> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number
> of entities and/or days in the query could be either very small or very
> large.
> I've modeled this with a simple column family for the keys with the row
> key being the concatenation of the entity and date. My first go, used only
> the entity as the row key and then used a supercolumn for each date. I
> decided against this mostly because it seemed more complex for a gain I
> didn't really understand.
> Does this seem sensible ?
> thanks
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  <>
> |
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215


*Franc Carter* | Systems architect | Sirca Ltd
 <> |

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

View raw message