cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <>
Subject Re: 1000's of column families
Date Mon, 01 Oct 2012 16:42:32 GMT
Well, I am now thinking of adding a virtual capability to PlayOrm which we
currently use to allow grouping entities into one column family.  Right
now the CF creation comes from a single entity so this then may change for
those entities that define they are in a single CF groupÅ .This should not
be a very hard change if we decide to do that.

This makes us rely even more on PlayOrm's command line tool(instead of
cassandra-cli) as I can't stand reading hex all the time nor do I like
switching my "assume validator to utf8 to decimal, to integer just so I
can read stuff".


On 10/1/12 9:22 AM, "Brian O'Neill" <> wrote:

>We have the same question...
>We have thousands of separate feeds of data as well (20,000+).  To
>date, we've been using a CF per feed strategy, but as we scale this
>thing out to accommodate all of those feeds, we're trying to figure
>out if we're going to blow out the memory.
>The initial documentation for heap sizing had column families in the
>But in the more recent documentation, it looks like they removed the
>column family variable with the introduction of the universal
>We haven't committed either way yet, but given Ed Anuff's presentation
>on virtual keyspaces, we were leaning towards a single column family
>Definitely let us know what you decide.
>On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
><> wrote:
>> We had some serious trouble with dynamically adding CFs, although last
>> we tried we were using version 0.7, so maybe
>> that's not an issue any more.
>> Our problems were two:
>> - You are (were?) not supposed to add CFs concurrently. Since we had
>> servers talking to the same Cassandra cluster,
>> we had to use distributed locks (Hazelcast) to avoid concurrency.
>> - You must be very careful to add new CFs to different Cassandra nodes.
>> you do that fast enough, and the clocks of
>> the two servers are skewed, you will severely compromise your schema
>> (Cassandra will not understand in which order the
>> updates must be applied).
>> As I said, this applied to version 0.7, maybe current versions solved
>> problems.
>> Flavio
>> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
>>> We have 1000's of different building devices and we stream data from
>> devices.  The format and data from each one varies so one device has
>> at timeX with some other variables, another device has CO2 percentage
>>and other
>> variables.  Every device is unique and streams it's own data.  We
>> discover devices and register them.  Basically, one CF or table per
>>thing really
>> makes sense in this environment.  While we could try to find out which
>> "are" similar, this would really be a pain and some devices add some new
>> variable into the equation.  NOT only that but researchers can register
>> datasets and upload them as well and each dataset they have they do NOT
>>want to
>> share with other researches necessarily so we have security groups and
>>each CF
>> belongs to security groups.  We dynamically create CF's on the fly as
>> register new datasets.
>>> On top of that, when the data sets get too large, we probably want to
>> partition a single CF into time partitions.  We could create one CF and
>>put all
>> the data and have a partition per device, but then a time partition
>>will contain
>> "multiple" devices of data meaning we need to shrink our time partition
>> where if we have CF per device, the time partition can be larger as it
>>is only
>> for that one device.
>>> THEN, on top of that, we have a meta CF for these devices so some
>>>people want
>> to query for streams that match criteria AND which returns a CF name
>>and they
>> query that CF name so we almost need a query with variables like select
>> from Meta where x = y and then select * from cfName where xxxxx. Which
>>we can do
>> today.
>>> Dean
>>> From: Marcelo Elias Del Valle
>>> Reply-To: "<>"
>> <<>>
>>> Date: Thursday, September 27, 2012 8:01 AM
>>> To: "<>"
>> <<>>
>>> Subject: Re: 1000's of column families
>>> Out of curiosity, is it really necessary to have that amount of CFs?
>>> I am probably still used to relational databases, where you would use
>>>a new
>> table just in case you need to store different kinds of data. As
>> stores anything in each CF, it might probably make sense to have a lot
>>of CFs to
>> store your data...
>>> But why wouldn't you use a single CF with partitions in these case?
>> it be the same thing? I am asking because I might learn a new modeling
>> with the answer.
>>> []s
>>> 2012/9/26 Hiller, Dean
>>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>>> When
>> using the tools they are all geared to analyzing ONE column family at a
>>time :(.
>> If I remember correctly, Cassandra supports as many CF's as you want,
>> Even though I am going to have tons of funs with limitations on the
>> correct?
>>> (I may end up wrapping the node tool with my own aggregate calls if
>>>needed to
>> sum up multiple column families and such).
>>> Thanks,
>>> Dean
>>> --
>>> Marcelo Elias Del Valle
>>> - @mvallebr
>Brian ONeill
>Lead Architect, Health Market Science (
>Apache Cassandra MVP
>twitter: @boneill42

View raw message