Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=subject
	:references:from:content-type:in-reply-to:message-id:date:to
	:content-transfer-encoding:mime-version; q=dns; s=
	thelastpickle.com; b=PaevYJmxgpcF0KqVoU2uMAQ9VcDTMcskHit8tXw6qmg
	g5efwwQUufGkIXHkndJZsCurb4UrTyu4Vq/iVIAC/3s4Hr69S+q7ygSH6pJ8ZuIS
	l54Y/CE1+22LQqiJMxCWIHaHCn+2tnQxYSZzwO+wKjZPFsaz3yeNgIFJoYtwz7sg
	=
Subject: Re: Cassandra for Ad-hoc Aggregation and formula calculation
References: <AANLkTimWwWkhd9Z6OttTeE2OZSbgbm8F5wu1ysL1Tc0q@mail.gmail.com>
 <AANLkTi=_=CjKVr+cu3cZr3J+WybaxoKry6tYoadRA4m9@mail.gmail.com>
 <4d0313c6.5054e70a.2dfd.ffff8e7f@mx.google.com>
From: Aaron Morton <aaron@thelastpickle.com>
Content-Type: text/plain;
	charset=utf-8
In-Reply-To: <4d0313c6.5054e70a.2dfd.ffff8e7f@mx.google.com>
Message-Id: <1EB19164-8B69-4BF8-8FB2-E2B247C0BB3D@thelastpickle.com>
Date: Mon, 13 Dec 2010 09:01:58 +1300
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (iPad Mail 8C148)

Nice email Dan.

I would also add if you are still in the initial stages take a look at Hadoo=
p+Pig. If your source data is write once read many it may be a better fit, b=
ut then you would also need to calculate the aggregates and store them somew=
here.=20

So Cassandra *may* be just what you want. The ability to keep large amounts o=
f data online with a high performance, and remove other servers form your st=
ack is a definite plus.

Aaron=20
On 11/12/2010, at 7:01 PM, Dan Hendry <dan.hendry.junk@gmail.com> wrote:

> Perhaps other, more experienced and reputable contributors to this list ca=
n comment but to be frank: Cassandra is probably not for you (at least for n=
ow). I personally feel Cassandra is one of the stronger NoSQL options out th=
ere and has the potential to become the defacto standard; but its not quite t=
here yet and does not inherently meet your requirements.
>=20
> To give you some background, I started experimenting with Cassandra as a p=
ersonal project to avoid having to look through days worth of server logs (a=
nd because I thought it was cool). The project ballooned and has become my o=
rganizations primary metrics and analytics platform which currently processe=
s 200 million+ events/records per day. I doubt any traditional database solu=
tion could have performed as well as Cassandra but the development and opera=
tions process has not been without severe growing pains.=20
>=20
>> 1. Storing several million data records per day (each record will be a
>> few KB in size) without any data loss.
>=20
> Absolutely, no problems on this front. A cluster of moderately beefy serve=
rs will handle this with no complaints. As long as you are careful to avoid h=
otspots in your data distribution, Cassandra truly is damn near linearly sca=
lable with hardware.=20
>=20
>> 2. Aggregation of certain fields in the stored records, like Avg
>> across time period.
>=20
> Cassandra cannot do this on its own (by design and for good reason). There=
 have been efforts to add support for higher level data processing languages=
 (such as pig and hive) but they are not 'out of the box solutions' and in m=
y experience, difficult to get working properly. I ended up writing my own d=
ata processing/report generation framework that works ridiculously well for m=
y particular case. In relation to your requirements, calculating averages ac=
ross fields would probably have to be implemented manually (and executed as a=
 periodic, automated task). Although non-trivial this isn=E2=80=99t quite as=
 bad as you might think.
>=20
>> 3. Using certain existing fields to calculate new values on the fly
>> and store it too.
>=20
> Not quite sure what you are asking here. To go back to the last point to c=
alculate anything new, you are probably going to have to load all the record=
s on which that calculation depends into a separate process/server. Generall=
y, I would say Cassandra isn=E2=80=99t particularly good at 'on the fly' dat=
a aggregation tasks (certainly not at all to the extent an SQL database is).=
 To be fair, thats also not what it is explicitly designed for or advertised=
 to do well.=20
>=20
>> 4. We were wondering if pre-aggregation was a good choice (calculating
>> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
>> need ad-hoc aggregation, does Cassandra support that over this amount
>> of data?
>=20
> Cassandra is GREAT for accessing/storing/retrieving/post-processing anythi=
ng that can be pre-computed. If you have been doing any amount of reading, y=
ou will likely have heard that in SQL you model data, in Cassandra (and most=
 other NoSQL databases) you model your queries (sorry for ripping off whoeve=
r said this originally). If there is one thing/concept I can say that I have=
 learned about Cassandra is pre-compute (or asynchronously compute) anything=
 you possibly can and don=E2=80=99t be afraid to write a ridiculous amount t=
o the Cassandra database. In terms of ad-hoc aggregation, there is no nice s=
imple scripting language for Cassandra data processing (eg SQL). That said, y=
ou can do most things pretty quick with a bit of code. Consider that loading=
 a few hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms=
, often < 10 ms particularly if they are cached). Our organization basically=
 uses the following approach: 'use Cassandra for generating continuous 10 se=
cond accuracy time series reports but MySQL and a production DB replica for a=
ny ad-hoc single value report the boss wants NOW'.
>=20
>=20
> Based on what you have described, it sounds like you are thinking about yo=
ur problem from a SQL-like point of view: store data once then query/filter/=
aggregate it in multiple different ways to obtain useful information. If pos=
sible try to leverage the power of Cassandra and store it in efficient and p=
er-query pre-optimized forms. For example, I can imagine the average call du=
ration being an important parameter in a system analyzing call data records.=
 Instead of storing all the information about a call in one place, store the=
 'call duration' in a separate column family, each row containing a single i=
nteger representing call duarations for a given hour (column name being the T=
imeUUID). My metrics system does something similar to this and loads batches=
 of 15,000 records (column slice) in < 200 ms. By parallelizing across 10 th=
reads loading from different rows, I can process the average, standard devia=
tion and a factor roughly meaning 'how close to Gaussian' for 1 million reco=
rds in < 5 seconds.=20
>=20
> To reiterate, Cassandra is not the solution if you are looking for 'Databa=
se: I command thee to give me the average of field x.' That said, I have fou=
nd its overall data-processing capabilities to be reasonably impressive.
>=20
> Dan
>=20
> -----Original Message-----
> From: Arun Cherian [mailto:archerian@gmail.com]=20
> Sent: December-10-10 16:43
> To: user@cassandra.apache.org
> Subject: Cassandra for Ad-hoc Aggregation and formula calculation
>=20
> Hi,
>=20
> I have been reading up on Cassandra for the past few weeks and I am
> highly impressed by the features it offers. At work, we are starting
> work on a product that will handle several million CDR (Call Data
> Record, basically can be thought of as a .CSV file) per day. We will
> have to store the data, and perform aggregations and calculations on
> them. A few veteran RDBMS admin friends (we are a small .NET shop, we
> don't have any in-house DB talent) recommended Infobright and noSQL to
> us, and hence my search. I was wondering if Cassandra is a good fit
> for
>=20
> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.
> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.
> 3. Using certain existing fields to calculate new values on the fly
> and store it too.
> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?
>=20
> Thanks,
> Arun
> No virus found in this incoming message.
> Checked by AVG - www.avg.com=20
> Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 0=
2:37:00
>=20