Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 42264 invoked from network); 12 Dec 2010 20:02:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Dec 2010 20:02:44 -0000 Received: (qmail 74099 invoked by uid 500); 12 Dec 2010 20:02:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74075 invoked by uid 500); 12 Dec 2010 20:02:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74067 invoked by uid 99); 12 Dec 2010 20:02:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 20:02:40 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a59.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 20:02:35 +0000 Received: from homiemail-a59.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a59.g.dreamhost.com (Postfix) with ESMTP id 7DC1756406A for ; Sun, 12 Dec 2010 12:02:09 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=subject :references:from:content-type:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; q=dns; s= thelastpickle.com; b=PaevYJmxgpcF0KqVoU2uMAQ9VcDTMcskHit8tXw6qmg g5efwwQUufGkIXHkndJZsCurb4UrTyu4Vq/iVIAC/3s4Hr69S+q7ygSH6pJ8ZuIS l54Y/CE1+22LQqiJMxCWIHaHCn+2tnQxYSZzwO+wKjZPFsaz3yeNgIFJoYtwz7sg = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h= subject:references:from:content-type:in-reply-to:message-id:date :to:content-transfer-encoding:mime-version; s=thelastpickle.com; bh=Ihvbr/xIy4ouIssB3GOn2LRwNqc=; b=ZLM/TrU1EB5Q7iG53fsJc4+eoVXC i+xO5ac4iDLiEswXbpRtCZFnMuCTh+VeZIyOBuRulIGKfTwC1oT9jXoOouqSDx79 tPXmMgGXO/RLDhx6LbxVbvvQCyQTIz8ZcoSe8dHyEdoN7tsWs6hsdfkw0VdJA7Tl TGahBKYljleS0/g= Received: from [115.189.239.64] (unknown [115.189.239.64]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a59.g.dreamhost.com (Postfix) with ESMTPSA id 49F7A56405C for ; Sun, 12 Dec 2010 12:02:07 -0800 (PST) Subject: Re: Cassandra for Ad-hoc Aggregation and formula calculation References: <4d0313c6.5054e70a.2dfd.ffff8e7f@mx.google.com> From: Aaron Morton Content-Type: text/plain; charset=utf-8 X-Mailer: iPad Mail (8C148) In-Reply-To: <4d0313c6.5054e70a.2dfd.ffff8e7f@mx.google.com> Message-Id: <1EB19164-8B69-4BF8-8FB2-E2B247C0BB3D@thelastpickle.com> Date: Mon, 13 Dec 2010 09:01:58 +1300 To: "user@cassandra.apache.org" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (iPad Mail 8C148) Nice email Dan. I would also add if you are still in the initial stages take a look at Hadoo= p+Pig. If your source data is write once read many it may be a better fit, b= ut then you would also need to calculate the aggregates and store them somew= here.=20 So Cassandra *may* be just what you want. The ability to keep large amounts o= f data online with a high performance, and remove other servers form your st= ack is a definite plus. Aaron=20 On 11/12/2010, at 7:01 PM, Dan Hendry wrote: > Perhaps other, more experienced and reputable contributors to this list ca= n comment but to be frank: Cassandra is probably not for you (at least for n= ow). I personally feel Cassandra is one of the stronger NoSQL options out th= ere and has the potential to become the defacto standard; but its not quite t= here yet and does not inherently meet your requirements. >=20 > To give you some background, I started experimenting with Cassandra as a p= ersonal project to avoid having to look through days worth of server logs (a= nd because I thought it was cool). The project ballooned and has become my o= rganizations primary metrics and analytics platform which currently processe= s 200 million+ events/records per day. I doubt any traditional database solu= tion could have performed as well as Cassandra but the development and opera= tions process has not been without severe growing pains.=20 >=20 >> 1. Storing several million data records per day (each record will be a >> few KB in size) without any data loss. >=20 > Absolutely, no problems on this front. A cluster of moderately beefy serve= rs will handle this with no complaints. As long as you are careful to avoid h= otspots in your data distribution, Cassandra truly is damn near linearly sca= lable with hardware.=20 >=20 >> 2. Aggregation of certain fields in the stored records, like Avg >> across time period. >=20 > Cassandra cannot do this on its own (by design and for good reason). There= have been efforts to add support for higher level data processing languages= (such as pig and hive) but they are not 'out of the box solutions' and in m= y experience, difficult to get working properly. I ended up writing my own d= ata processing/report generation framework that works ridiculously well for m= y particular case. In relation to your requirements, calculating averages ac= ross fields would probably have to be implemented manually (and executed as a= periodic, automated task). Although non-trivial this isn=E2=80=99t quite as= bad as you might think. >=20 >> 3. Using certain existing fields to calculate new values on the fly >> and store it too. >=20 > Not quite sure what you are asking here. To go back to the last point to c= alculate anything new, you are probably going to have to load all the record= s on which that calculation depends into a separate process/server. Generall= y, I would say Cassandra isn=E2=80=99t particularly good at 'on the fly' dat= a aggregation tasks (certainly not at all to the extent an SQL database is).= To be fair, thats also not what it is explicitly designed for or advertised= to do well.=20 >=20 >> 4. We were wondering if pre-aggregation was a good choice (calculating >> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we >> need ad-hoc aggregation, does Cassandra support that over this amount >> of data? >=20 > Cassandra is GREAT for accessing/storing/retrieving/post-processing anythi= ng that can be pre-computed. If you have been doing any amount of reading, y= ou will likely have heard that in SQL you model data, in Cassandra (and most= other NoSQL databases) you model your queries (sorry for ripping off whoeve= r said this originally). If there is one thing/concept I can say that I have= learned about Cassandra is pre-compute (or asynchronously compute) anything= you possibly can and don=E2=80=99t be afraid to write a ridiculous amount t= o the Cassandra database. In terms of ad-hoc aggregation, there is no nice s= imple scripting language for Cassandra data processing (eg SQL). That said, y= ou can do most things pretty quick with a bit of code. Consider that loading= a few hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms= , often < 10 ms particularly if they are cached). Our organization basically= uses the following approach: 'use Cassandra for generating continuous 10 se= cond accuracy time series reports but MySQL and a production DB replica for a= ny ad-hoc single value report the boss wants NOW'. >=20 >=20 > Based on what you have described, it sounds like you are thinking about yo= ur problem from a SQL-like point of view: store data once then query/filter/= aggregate it in multiple different ways to obtain useful information. If pos= sible try to leverage the power of Cassandra and store it in efficient and p= er-query pre-optimized forms. For example, I can imagine the average call du= ration being an important parameter in a system analyzing call data records.= Instead of storing all the information about a call in one place, store the= 'call duration' in a separate column family, each row containing a single i= nteger representing call duarations for a given hour (column name being the T= imeUUID). My metrics system does something similar to this and loads batches= of 15,000 records (column slice) in < 200 ms. By parallelizing across 10 th= reads loading from different rows, I can process the average, standard devia= tion and a factor roughly meaning 'how close to Gaussian' for 1 million reco= rds in < 5 seconds.=20 >=20 > To reiterate, Cassandra is not the solution if you are looking for 'Databa= se: I command thee to give me the average of field x.' That said, I have fou= nd its overall data-processing capabilities to be reasonably impressive. >=20 > Dan >=20 > -----Original Message----- > From: Arun Cherian [mailto:archerian@gmail.com]=20 > Sent: December-10-10 16:43 > To: user@cassandra.apache.org > Subject: Cassandra for Ad-hoc Aggregation and formula calculation >=20 > Hi, >=20 > I have been reading up on Cassandra for the past few weeks and I am > highly impressed by the features it offers. At work, we are starting > work on a product that will handle several million CDR (Call Data > Record, basically can be thought of as a .CSV file) per day. We will > have to store the data, and perform aggregations and calculations on > them. A few veteran RDBMS admin friends (we are a small .NET shop, we > don't have any in-house DB talent) recommended Infobright and noSQL to > us, and hence my search. I was wondering if Cassandra is a good fit > for >=20 > 1. Storing several million data records per day (each record will be a > few KB in size) without any data loss. > 2. Aggregation of certain fields in the stored records, like Avg > across time period. > 3. Using certain existing fields to calculate new values on the fly > and store it too. > 4. We were wondering if pre-aggregation was a good choice (calculating > aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we > need ad-hoc aggregation, does Cassandra support that over this amount > of data? >=20 > Thanks, > Arun > No virus found in this incoming message. > Checked by AVG - www.avg.com=20 > Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 0= 2:37:00 >=20