incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hendry" <dan.hendry.j...@gmail.com>
Subject RE: Cassandra for Ad-hoc Aggregation and formula calculation
Date Sat, 11 Dec 2010 06:01:31 GMT
Perhaps other, more experienced and reputable contributors to this list can comment but to
be frank: Cassandra is probably not for you (at least for now). I personally feel Cassandra
is one of the stronger NoSQL options out there and has the potential to become the defacto
standard; but its not quite there yet and does not inherently meet your requirements.

To give you some background, I started experimenting with Cassandra as a personal project
to avoid having to look through days worth of server logs (and because I thought it was cool).
The project ballooned and has become my organizations primary metrics and analytics platform
which currently processes 200 million+ events/records per day. I doubt any traditional database
solution could have performed as well as Cassandra but the development and operations process
has not been without severe growing pains. 

> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.

Absolutely, no problems on this front. A cluster of moderately beefy servers will handle this
with no complaints. As long as you are careful to avoid hotspots in your data distribution,
Cassandra truly is damn near linearly scalable with hardware. 

> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.

Cassandra cannot do this on its own (by design and for good reason). There have been efforts
to add support for higher level data processing languages (such as pig and hive) but they
are not 'out of the box solutions' and in my experience, difficult to get working properly.
I ended up writing my own data processing/report generation framework that works ridiculously
well for my particular case. In relation to your requirements, calculating averages across
fields would probably have to be implemented manually (and executed as a periodic, automated
task). Although non-trivial this isn’t quite as bad as you might think.

> 3. Using certain existing fields to calculate new values on the fly
> and store it too.

Not quite sure what you are asking here. To go back to the last point to calculate anything
new, you are probably going to have to load all the records on which that calculation depends
into a separate process/server. Generally, I would say Cassandra isn’t particularly good
at 'on the fly' data aggregation tasks (certainly not at all to the extent an SQL database
is). To be fair, thats also not what it is explicitly designed for or advertised to do well.


> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?

Cassandra is GREAT for accessing/storing/retrieving/post-processing anything that can be pre-computed.
If you have been doing any amount of reading, you will likely have heard that in SQL you model
data, in Cassandra (and most other NoSQL databases) you model your queries (sorry for ripping
off whoever said this originally). If there is one thing/concept I can say that I have learned
about Cassandra is pre-compute (or asynchronously compute) anything you possibly can and don’t
be afraid to write a ridiculous amount to the Cassandra database. In terms of ad-hoc aggregation,
there is no nice simple scripting language for Cassandra data processing (eg SQL). That said,
you can do most things pretty quick with a bit of code. Consider that loading a few hundred
to a few thousand record (< 3k) can be pretty quick (< 100 ms, often < 10 ms particularly
if they are cached). Our organization basically uses the following approach: 'use Cassandra
for generating continuous 10 second accuracy time series reports but MySQL and a production
DB replica for any ad-hoc single value report the boss wants NOW'.


Based on what you have described, it sounds like you are thinking about your problem from
a SQL-like point of view: store data once then query/filter/aggregate it in multiple different
ways to obtain useful information. If possible try to leverage the power of Cassandra and
store it in efficient and per-query pre-optimized forms. For example, I can imagine the average
call duration being an important parameter in a system analyzing call data records. Instead
of storing all the information about a call in one place, store the 'call duration' in a separate
column family, each row containing a single integer representing call duarations for a given
hour (column name being the TimeUUID). My metrics system does something similar to this and
loads batches of 15,000 records (column slice) in < 200 ms. By parallelizing across 10
threads loading from different rows, I can process the average, standard deviation and a factor
roughly meaning 'how close to Gaussian' for 1 million records in < 5 seconds. 

To reiterate, Cassandra is not the solution if you are looking for 'Database: I command thee
to give me the average of field x.' That said, I have found its overall data-processing capabilities
to be reasonably impressive.

Dan

-----Original Message-----
From: Arun Cherian [mailto:archerian@gmail.com] 
Sent: December-10-10 16:43
To: user@cassandra.apache.org
Subject: Cassandra for Ad-hoc Aggregation and formula calculation

Hi,

I have been reading up on Cassandra for the past few weeks and I am
highly impressed by the features it offers. At work, we are starting
work on a product that will handle several million CDR (Call Data
Record, basically can be thought of as a .CSV file) per day. We will
have to store the data, and perform aggregations and calculations on
them. A few veteran RDBMS admin friends (we are a small .NET shop, we
don't have any in-house DB talent) recommended Infobright and noSQL to
us, and hence my search. I was wondering if Cassandra is a good fit
for

1. Storing several million data records per day (each record will be a
few KB in size) without any data loss.
2. Aggregation of certain fields in the stored records, like Avg
across time period.
3. Using certain existing fields to calculate new values on the fly
and store it too.
4. We were wondering if pre-aggregation was a good choice (calculating
aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
need ad-hoc aggregation, does Cassandra support that over this amount
of data?

Thanks,
Arun
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 02:37:00


Mime
View raw message