incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <Dean.Hil...@nrel.gov>
Subject Re: 1000's of column families
Date Fri, 28 Sep 2012 11:14:15 GMT
I thought someone was saying each column family added to RAM on every node not RAM on a single
node.  It adds RAM on every node???  So eventually, I will run out?  Was that person wrong?
 This would mean adding nodes does not help if he is right.  Can anyone confirm this?

Thanks,
Dean

From: Robin Verlangen <robin@us2.nl<mailto:robin@us2.nl>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, September 27, 2012 11:52 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: 1000's of column families

"so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes."

This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains
a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<mailto:robin@us2.nl>

[http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]<http://goo.gl/Lt7BC>

Disclaimer: The information contained in this message and attachments is intended solely for
the attention and use of the named addressee and may be confidential. If you are not the intended
recipient, you are reminded that the information remains the property of the sender. You must
not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this
message in error, please contact the sender immediately and irrevocably delete this message
and any copies.



2012/9/27 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
Unfortunately, the security aspect is very strict.  Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesÅ ALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM?  So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, "Aaron Turner" <synfinatic@gmail.com<mailto:synfinatic@gmail.com>>
wrote:

>On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
>wrote:
>> We have 1000's of different building devices and we stream data from
>>these devices.  The format and data from each one varies so one device
>>has temperature at timeX with some other variables, another device has
>>CO2 percentage and other variables.  Every device is unique and streams
>>it's own data.  We dynamically discover devices and register them.
>>Basically, one CF or table per thing really makes sense in this
>>environment.  While we could try to find out which devices "are"
>>similar, this would really be a pain and some devices add some new
>>variable into the equation.  NOT only that but researchers can register
>>new datasets and upload them as well and each dataset they have they do
>>NOT want to share with other researches necessarily so we have security
>>groups and each CF belongs to security groups.  We dynamically create
>>CF's on the fly as people register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to
>>partition a single CF into time partitions.  We could create one CF and
>>put all the data and have a partition per device, but then a time
>>partition will contain "multiple" devices of data meaning we need to
>>shrink our time partition size where if we have CF per device, the time
>>partition can be larger as it is only for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some
>>people want to query for streams that match criteria AND which returns a
>>CF name and they query that CF name so we almost need a query with
>>variables like select cfName from Meta where x = y and then select *
>>from cfName where xxxxx. Which we can do today.
>
>How strict are your security requirements?  If it wasn't for that,
>you'd be much better off storing data on a per-statistic basis then
>per-device.  Hell, you could store everything in a single CF by using
>a composite row key:
>
><devicename>|<stat type>|<instance>
>
>But yeah, there isn't a hard limit for the number of CF's, but there
>is overhead associated with each one and so I wouldn't consider your
>design as scalable.  Generally speaking, hundreds are ok, but
>thousands is pushing it.
>
>
>
>--
>Aaron Turner
>http://synfin.net/         Twitter: @synfinatic
>http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
>Windows
>Those who would give up essential Liberty, to purchase a little temporary
>Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
>"carpe diem quam minimum credula postero"



Mime
View raw message