cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Baronti <>
Subject Re: 1000's of column families
Date Fri, 28 Sep 2012 15:48:27 GMT
We had some serious trouble with dynamically adding CFs, although last time we tried we were
using version 0.7, so maybe 
that's not an issue any more.
Our problems were two:
- You are (were?) not supposed to add CFs concurrently. Since we had more servers talking
to the same Cassandra cluster, 
we had to use distributed locks (Hazelcast) to avoid concurrency.
- You must be very careful to add new CFs to different Cassandra nodes. If you do that fast
enough, and the clocks of 
the two servers are skewed, you will severely compromise your schema (Cassandra will not understand
in which order the 
updates must be applied).

As I said, this applied to version 0.7, maybe current versions solved these problems.


Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
> We have 1000's of different building devices and we stream data from these devices. 
The format and data from each one varies so one device has temperature at timeX with some
other variables, another device has CO2 percentage and other variables.  Every device is unique
and streams it's own data.  We dynamically discover devices and register them.  Basically,
one CF or table per thing really makes sense in this environment.  While we could try to find
out which devices "are" similar, this would really be a pain and some devices add some new
variable into the equation.  NOT only that but researchers can register new datasets and upload
them as well and each dataset they have they do NOT want to share with other researches necessarily
so we have security groups and each CF belongs to security groups.  We dynamically create
CF's on the fly as people register new datasets.
> On top of that, when the data sets get too large, we probably want to partition a single
CF into time partitions.  We could create one CF and put all the data and have a partition
per device, but then a time partition will contain "multiple" devices of data meaning we need
to shrink our time partition size where if we have CF per device, the time partition can be
larger as it is only for that one device.
> THEN, on top of that, we have a meta CF for these devices so some people want to query
for streams that match criteria AND which returns a CF name and they query that CF name so
we almost need a query with variables like select cfName from Meta where x = y and then select
* from cfName where xxxxx. Which we can do today.
> Dean
> From: Marcelo Elias Del Valle <<>>
> Reply-To: "<>" <<>>
> Date: Thursday, September 27, 2012 8:01 AM
> To: "<>" <<>>
> Subject: Re: 1000's of column families
> Out of curiosity, is it really necessary to have that amount of CFs?
> I am probably still used to relational databases, where you would use a new table just
in case you need to store different kinds of data. As Cassandra stores anything in each CF,
it might probably make sense to have a lot of CFs to store your data...
> But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the
same thing? I am asking because I might learn a new modeling technique with the answer.
> []s
> 2012/9/26 Hiller, Dean <<>>
> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the
tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly,
Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons
of funs with limitations on the tools, correct?
> (I may end up wrapping the node tool with my own aggregate calls if needed to sum up
multiple column families and such).
> Thanks,
> Dean
> --
> Marcelo Elias Del Valle
> - @mvallebr

View raw message