incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <Dean.Hil...@nrel.gov>
Subject Re: 1000's of column families
Date Mon, 01 Oct 2012 16:42:32 GMT
Well, I am now thinking of adding a virtual capability to PlayOrm which we
currently use to allow grouping entities into one column family.  Right
now the CF creation comes from a single entity so this then may change for
those entities that define they are in a single CF groupÅ .This should not
be a very hard change if we decide to do that.

This makes us rely even more on PlayOrm's command line tool(instead of
cassandra-cli) as I can't stand reading hex all the time nor do I like
switching my "assume validator to utf8 to decimal, to integer just so I
can read stuff".

Later,
Dean

On 10/1/12 9:22 AM, "Brian O'Neill" <bone@alumni.brown.edu> wrote:

>Dean,
>
>We have the same question...
>
>We have thousands of separate feeds of data as well (20,000+).  To
>date, we've been using a CF per feed strategy, but as we scale this
>thing out to accommodate all of those feeds, we're trying to figure
>out if we're going to blow out the memory.
>
>The initial documentation for heap sizing had column families in the
>equation:
>http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing
>
>But in the more recent documentation, it looks like they removed the
>column family variable with the introduction of the universal
>key_cache_size.
>http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size
>
>We haven't committed either way yet, but given Ed Anuff's presentation
>on virtual keyspaces, we were leaning towards a single column family
>approach:
>http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassand
>ra_-_apigee_under_the_hood/?
>
>Definitely let us know what you decide.
>
>-brian
>
>On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
><f.baronti@list-group.com> wrote:
>> We had some serious trouble with dynamically adding CFs, although last
>>time
>> we tried we were using version 0.7, so maybe
>> that's not an issue any more.
>> Our problems were two:
>> - You are (were?) not supposed to add CFs concurrently. Since we had
>>more
>> servers talking to the same Cassandra cluster,
>> we had to use distributed locks (Hazelcast) to avoid concurrency.
>> - You must be very careful to add new CFs to different Cassandra nodes.
>>If
>> you do that fast enough, and the clocks of
>> the two servers are skewed, you will severely compromise your schema
>> (Cassandra will not understand in which order the
>> updates must be applied).
>>
>> As I said, this applied to version 0.7, maybe current versions solved
>>these
>> problems.
>>
>> Flavio
>>
>>
>> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
>>> We have 1000's of different building devices and we stream data from
>>>these
>> devices.  The format and data from each one varies so one device has
>>temperature
>> at timeX with some other variables, another device has CO2 percentage
>>and other
>> variables.  Every device is unique and streams it's own data.  We
>>dynamically
>> discover devices and register them.  Basically, one CF or table per
>>thing really
>> makes sense in this environment.  While we could try to find out which
>>devices
>> "are" similar, this would really be a pain and some devices add some new
>> variable into the equation.  NOT only that but researchers can register
>>new
>> datasets and upload them as well and each dataset they have they do NOT
>>want to
>> share with other researches necessarily so we have security groups and
>>each CF
>> belongs to security groups.  We dynamically create CF's on the fly as
>>people
>> register new datasets.
>>>
>>> On top of that, when the data sets get too large, we probably want to
>> partition a single CF into time partitions.  We could create one CF and
>>put all
>> the data and have a partition per device, but then a time partition
>>will contain
>> "multiple" devices of data meaning we need to shrink our time partition
>>size
>> where if we have CF per device, the time partition can be larger as it
>>is only
>> for that one device.
>>>
>>> THEN, on top of that, we have a meta CF for these devices so some
>>>people want
>> to query for streams that match criteria AND which returns a CF name
>>and they
>> query that CF name so we almost need a query with variables like select
>>cfName
>> from Meta where x = y and then select * from cfName where xxxxx. Which
>>we can do
>> today.
>>>
>>> Dean
>>>
>>> From: Marcelo Elias Del Valle
>>><mvallebr@gmail.com<mailto:mvallebr@gmail.com>>
>>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
>> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>>> Date: Thursday, September 27, 2012 8:01 AM
>>> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
>> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>>> Subject: Re: 1000's of column families
>>>
>>> Out of curiosity, is it really necessary to have that amount of CFs?
>>> I am probably still used to relational databases, where you would use
>>>a new
>> table just in case you need to store different kinds of data. As
>>Cassandra
>> stores anything in each CF, it might probably make sense to have a lot
>>of CFs to
>> store your data...
>>> But why wouldn't you use a single CF with partitions in these case?
>>>Wouldn't
>> it be the same thing? I am asking because I might learn a new modeling
>>technique
>> with the answer.
>>>
>>> []s
>>>
>>> 2012/9/26 Hiller, Dean
>>><Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
>>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>>> When
>> using the tools they are all geared to analyzing ONE column family at a
>>time :(.
>> If I remember correctly, Cassandra supports as many CF's as you want,
>>correct?
>> Even though I am going to have tons of funs with limitations on the
>>tools,
>> correct?
>>>
>>> (I may end up wrapping the node tool with my own aggregate calls if
>>>needed to
>> sum up multiple column families and such).
>>>
>>> Thanks,
>>> Dean
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>
>
>
>-- 
>Brian ONeill
>Lead Architect, Health Market Science (http://healthmarketscience.com)
>Apache Cassandra MVP
>mobile:215.588.6024
>blog: http://brianoneill.blogspot.com/
>twitter: @boneill42


Mime
View raw message