incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Cassandra for Ad-hoc Aggregation and formula calculation
Date Sun, 12 Dec 2010 20:01:58 GMT
Nice email Dan.

I would also add if you are still in the initial stages take a look at Hadoop+Pig. If your
source data is write once read many it may be a better fit, but then you would also need to
calculate the aggregates and store them somewhere. 

So Cassandra *may* be just what you want. The ability to keep large amounts of data online
with a high performance, and remove other servers form your stack is a definite plus.

Aaron 
On 11/12/2010, at 7:01 PM, Dan Hendry <dan.hendry.junk@gmail.com> wrote:

> Perhaps other, more experienced and reputable contributors to this list can comment but
to be frank: Cassandra is probably not for you (at least for now). I personally feel Cassandra
is one of the stronger NoSQL options out there and has the potential to become the defacto
standard; but its not quite there yet and does not inherently meet your requirements.
> 
> To give you some background, I started experimenting with Cassandra as a personal project
to avoid having to look through days worth of server logs (and because I thought it was cool).
The project ballooned and has become my organizations primary metrics and analytics platform
which currently processes 200 million+ events/records per day. I doubt any traditional database
solution could have performed as well as Cassandra but the development and operations process
has not been without severe growing pains. 
> 
>> 1. Storing several million data records per day (each record will be a
>> few KB in size) without any data loss.
> 
> Absolutely, no problems on this front. A cluster of moderately beefy servers will handle
this with no complaints. As long as you are careful to avoid hotspots in your data distribution,
Cassandra truly is damn near linearly scalable with hardware. 
> 
>> 2. Aggregation of certain fields in the stored records, like Avg
>> across time period.
> 
> Cassandra cannot do this on its own (by design and for good reason). There have been
efforts to add support for higher level data processing languages (such as pig and hive) but
they are not 'out of the box solutions' and in my experience, difficult to get working properly.
I ended up writing my own data processing/report generation framework that works ridiculously
well for my particular case. In relation to your requirements, calculating averages across
fields would probably have to be implemented manually (and executed as a periodic, automated
task). Although non-trivial this isn’t quite as bad as you might think.
> 
>> 3. Using certain existing fields to calculate new values on the fly
>> and store it too.
> 
> Not quite sure what you are asking here. To go back to the last point to calculate anything
new, you are probably going to have to load all the records on which that calculation depends
into a separate process/server. Generally, I would say Cassandra isn’t particularly good
at 'on the fly' data aggregation tasks (certainly not at all to the extent an SQL database
is). To be fair, thats also not what it is explicitly designed for or advertised to do well.

> 
>> 4. We were wondering if pre-aggregation was a good choice (calculating
>> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
>> need ad-hoc aggregation, does Cassandra support that over this amount
>> of data?
> 
> Cassandra is GREAT for accessing/storing/retrieving/post-processing anything that can
be pre-computed. If you have been doing any amount of reading, you will likely have heard
that in SQL you model data, in Cassandra (and most other NoSQL databases) you model your queries
(sorry for ripping off whoever said this originally). If there is one thing/concept I can
say that I have learned about Cassandra is pre-compute (or asynchronously compute) anything
you possibly can and don’t be afraid to write a ridiculous amount to the Cassandra database.
In terms of ad-hoc aggregation, there is no nice simple scripting language for Cassandra data
processing (eg SQL). That said, you can do most things pretty quick with a bit of code. Consider
that loading a few hundred to a few thousand record (< 3k) can be pretty quick (< 100
ms, often < 10 ms particularly if they are cached). Our organization basically uses the
following approach: 'use Cassandra for generating continuous 10 second accuracy time series
reports but MySQL and a production DB replica for any ad-hoc single value report the boss
wants NOW'.
> 
> 
> Based on what you have described, it sounds like you are thinking about your problem
from a SQL-like point of view: store data once then query/filter/aggregate it in multiple
different ways to obtain useful information. If possible try to leverage the power of Cassandra
and store it in efficient and per-query pre-optimized forms. For example, I can imagine the
average call duration being an important parameter in a system analyzing call data records.
Instead of storing all the information about a call in one place, store the 'call duration'
in a separate column family, each row containing a single integer representing call duarations
for a given hour (column name being the TimeUUID). My metrics system does something similar
to this and loads batches of 15,000 records (column slice) in < 200 ms. By parallelizing
across 10 threads loading from different rows, I can process the average, standard deviation
and a factor roughly meaning 'how close to Gaussian' for 1 million records in < 5 seconds.

> 
> To reiterate, Cassandra is not the solution if you are looking for 'Database: I command
thee to give me the average of field x.' That said, I have found its overall data-processing
capabilities to be reasonably impressive.
> 
> Dan
> 
> -----Original Message-----
> From: Arun Cherian [mailto:archerian@gmail.com] 
> Sent: December-10-10 16:43
> To: user@cassandra.apache.org
> Subject: Cassandra for Ad-hoc Aggregation and formula calculation
> 
> Hi,
> 
> I have been reading up on Cassandra for the past few weeks and I am
> highly impressed by the features it offers. At work, we are starting
> work on a product that will handle several million CDR (Call Data
> Record, basically can be thought of as a .CSV file) per day. We will
> have to store the data, and perform aggregations and calculations on
> them. A few veteran RDBMS admin friends (we are a small .NET shop, we
> don't have any in-house DB talent) recommended Infobright and noSQL to
> us, and hence my search. I was wondering if Cassandra is a good fit
> for
> 
> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.
> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.
> 3. Using certain existing fields to calculate new values on the fly
> and store it too.
> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?
> 
> Thanks,
> Arun
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 02:37:00
> 

Mime
View raw message