From user-return-29197-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Fri Sep 28 15:06:47 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0FC9ADD0A for ; Fri, 28 Sep 2012 15:06:47 +0000 (UTC) Received: (qmail 52449 invoked by uid 500); 28 Sep 2012 15:06:44 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 52368 invoked by uid 500); 28 Sep 2012 15:06:44 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 52359 invoked by uid 99); 28 Sep 2012 15:06:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Sep 2012 15:06:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of synfinatic@gmail.com designates 209.85.219.44 as permitted sender) Received: from [209.85.219.44] (HELO mail-oa0-f44.google.com) (209.85.219.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Sep 2012 15:06:38 +0000 Received: by oagn5 with SMTP id n5so3697906oag.31 for ; Fri, 28 Sep 2012 08:06:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=aQOXFkUAGqfOVg0svmOG2uBP12sJI953+DscysGLUKY=; b=M8Wmvwv47p6gAo3YW0xHQPMTnWQ03lQAZoklUttL2nnTRWSa2KWQR9JFcIZUZz+3iE EJn3OYVLqlDFo/UrX5tUm0VRO8WK50K3gRcF/V/rRDKGGFiofnGCVuBcjEr+AUNHwAXq vhHlTrxoljB7/7dZqXW8ayBeL5EwP6T4cicljtPUtl5F9GtGFQGd512jreFCx7Ysof6Y dHCm2o9bT4eciIf4e84ddEU9C88AGM9pjYj4hjIiHU0PRe+fGAZHr3Dc4ZP5YmXEAM+m TM2/xKG0Q3EiS/hFIYmjesxNMQpH4F8Jb+8GSWeLBqrWDS6IACF98eZEzaRF+jxeFvgB td4Q== Received: by 10.182.18.143 with SMTP id w15mr5999923obd.6.1348844777476; Fri, 28 Sep 2012 08:06:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.60.42.166 with HTTP; Fri, 28 Sep 2012 08:05:57 -0700 (PDT) In-Reply-To: References: From: Aaron Turner Date: Fri, 28 Sep 2012 16:05:57 +0100 Message-ID: Subject: Re: 1000's of column families To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yeah, you can't scale the number of CF's by adding new nodes to a cluster. You have to create multiple clusters. Anyways, I was thinking about your problem and the solution seems to be give each team/project their own CF and have them use composite row keys as I wrote about earlier. Yes that may mean you store the data for the same node multiple times, but that's pretty typical with Cassandra where you're de-normalizing your data to meet your query needs and Cassandra does seem to scale that way. Also, if you're not already using compression, you should. My experience with compression and time series data has been pretty amazing, especially with my CF's where I store a days worth of data in a single column as a vector. That gives you the best of both worlds: You get your per-team security and I'd assume (ha!) that would dramatically reduce the number of CF's you have to deal with since it's per-team/project and not per-device. On Fri, Sep 28, 2012 at 12:14 PM, Hiller, Dean wrote= : > I thought someone was saying each column family added to RAM on every nod= e not RAM on a single node. It adds RAM on every node??? So eventually, I= will run out? Was that person wrong? This would mean adding nodes does n= ot help if he is right. Can anyone confirm this? > > Thanks, > Dean > > From: Robin Verlangen > > Reply-To: "user@cassandra.apache.org" <= user@cassandra.apache.org> > Date: Thursday, September 27, 2012 11:52 PM > To: "user@cassandra.apache.org" > > Subject: Re: 1000's of column families > > "so if you add up all the applications > which would be huge and then all the tables which is large, it just keeps > growing. It is a very nice concept(all data in one location), though we > will see how implementing it goes." > > This shouldn't be a real problem for Cassandra. Just add more nodes and e= ver node contains a smaller piece of the cake (~ring). > > Best regards, > > Robin Verlangen > Software engineer > > W http://www.robinverlangen.nl > E robin@us2.nl > > [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]<= http://goo.gl/Lt7BC> > > Disclaimer: The information contained in this message and attachments is = intended solely for the attention and use of the named addressee and may be= confidential. If you are not the intended recipient, you are reminded that= the information remains the property of the sender. You must not use, disc= lose, distribute, copy, print or rely on this e-mail. If you have received = this message in error, please contact the sender immediately and irrevocabl= y delete this message and any copies. > > > > 2012/9/27 Hiller, Dean = > > Unfortunately, the security aspect is very strict. Some make their data > public but there are many projects where due to client contracts, they > cannot make their data public within our company(ie. Other groups in our > company are not allowed to see the data). > > Also, currently, we have researchers upload their own datasets as well. > Ideally, they want to see this one noSQL store as the place where all dat= a > for the company lives=C5=A0ALL of it so if you add up all the application= s > which would be huge and then all the tables which is large, it just keeps > growing. It is a very nice concept(all data in one location), though we > will see how implementing it goes. > > How much overhead per column family in RAM? So far we have around 4000 > Cfs with no issue that I see yet. > > Dean > > On 9/27/12 11:10 AM, "Aaron Turner" > wrote: > >>On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean > >>wrote: >>> We have 1000's of different building devices and we stream data from >>>these devices. The format and data from each one varies so one device >>>has temperature at timeX with some other variables, another device has >>>CO2 percentage and other variables. Every device is unique and streams >>>it's own data. We dynamically discover devices and register them. >>>Basically, one CF or table per thing really makes sense in this >>>environment. While we could try to find out which devices "are" >>>similar, this would really be a pain and some devices add some new >>>variable into the equation. NOT only that but researchers can register >>>new datasets and upload them as well and each dataset they have they do >>>NOT want to share with other researches necessarily so we have security >>>groups and each CF belongs to security groups. We dynamically create >>>CF's on the fly as people register new datasets. >>> >>> On top of that, when the data sets get too large, we probably want to >>>partition a single CF into time partitions. We could create one CF and >>>put all the data and have a partition per device, but then a time >>>partition will contain "multiple" devices of data meaning we need to >>>shrink our time partition size where if we have CF per device, the time >>>partition can be larger as it is only for that one device. >>> >>> THEN, on top of that, we have a meta CF for these devices so some >>>people want to query for streams that match criteria AND which returns a >>>CF name and they query that CF name so we almost need a query with >>>variables like select cfName from Meta where x =3D y and then select * >>>from cfName where xxxxx. Which we can do today. >> >>How strict are your security requirements? If it wasn't for that, >>you'd be much better off storing data on a per-statistic basis then >>per-device. Hell, you could store everything in a single CF by using >>a composite row key: >> >>|| >> >>But yeah, there isn't a hard limit for the number of CF's, but there >>is overhead associated with each one and so I wouldn't consider your >>design as scalable. Generally speaking, hundreds are ok, but >>thousands is pushing it. >> >> >> >>-- >>Aaron Turner >>http://synfin.net/ Twitter: @synfinatic >>http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & >>Windows >>Those who would give up essential Liberty, to purchase a little temporary >>Safety, deserve neither Liberty nor Safety. >> -- Benjamin Franklin >>"carpe diem quam minimum credula postero" > > --=20 Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Win= dows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin "carpe diem quam minimum credula postero"