incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franc Carter <franc.car...@sirca.org.au>
Subject Re: sensible data model ?
Date Tue, 07 Feb 2012 01:28:32 GMT
On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <aaron@thelastpickle.com>wrote:

> Sounds like a good start. Super columns are not a great fit for modeling
> time series data for a few reasons, here is one
> http://wiki.apache.org/cassandra/CassandraLimitations
>


None of those jump out at me as horrible for my case. If I modelled with
Super Columns I would have less than 10,000 Super Columns with an average
of 50 columns - big but no insane ?


>
> It's also a good idea to partition time series data so that the rows do
> not grow too big. You can have 2 billion columns in a row, but big rows
> have operational down sides.
>
> You could go with either:
>
> rows: <entity_id:date>
> column: <property_name>
>
> Which would mean each time your query for a date range you need to query
> multiple rows. But it is possible to get a range of  columns / properties.
>
> Or
>
> rows: <entity_id:time_partition>
> column: <date:property_name>
>

That's an interesting idea - I'll talk to the data experts to see if we
have a sensible range.


>
> Where time_partition is something that makes sense in your problem domain,
> e.g. a calendar month. If you often query for days in a month you  can then
> get all the columns for the days you are interested in (using a column
> range). If you only want to get a sub set of the entity properties you will
> need to get them all and filter them client side, depending on the number
> and size of the properties this may be more efficient than multiple calls.
>

I'm find with doing work on the client side - I have a bias in that
direction as it tends to scale better.


>
> One word of warning, avoid sending read requests for lots (i.e. 100's) of
> rows at once it will reduce overall query throughput. Some clients like
> pycassa take care of this for you.
>

Because of request overhead ? I'm currently using the batch interface of
pycassa to do bulk reads. Is the same problem going to bite me if I have
many clients reading (using bulk reads) ? In production we will have ~50
clients.

thanks


> Good luck.
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 5/02/2012, at 12:12 AM, Franc Carter wrote:
>
>
> Hi,
>
> I'm pretty new to Cassandra and am currently doing a proof of concept, and
> thought it would be a good idea to ask if my data model is sane . . .
>
> The data I have, and need to query, is reasonably simple. It consists of
> about 10 million entities, each of which have a set of key/value properties
> for each day for about 10 years. The number of keys is in the 50-100 range
> and there will be a lot of overlap for keys in <entity,days>
>
> The queries I need to make are for sets of key/value properties for an
> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number
> of entities and/or days in the query could be either very small or very
> large.
>
> I've modeled this with a simple column family for the keys with the row
> key being the concatenation of the entity and date. My first go, used only
> the entity as the row key and then used a supercolumn for each date. I
> decided against this mostly because it seemed more complex for a gain I
> didn't really understand.
>
> Does this seem sensible ?
>
> thanks
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferranti@sirca.org.au>
> franc.carter@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 <marc.zianideferranti@sirca.org.au>

franc.carter@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

Mime
View raw message