cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Gerken <>
Subject Re: General questions about Cassandra
Date Fri, 17 Feb 2012 16:59:06 GMT
In response to an offline question…

There are two usage patterns for Cassandra column families, static and dynamic.  With both
approaches you store objects of a given type into a column family.

With static usage the object type you're persisting has a single key and each row in the column
family maps to a single object.  The value of an object's key is stored in the row key and
each of the object's properties is stored in a column whose name is the name of the property
and whose value is the property value.  There are the same number of columns in a row as there
are non-null property values. This usage is very much like traditional relational database

With dynamic usage the object type to be persisted has two keys (I'll get to composite keys
in a bit).  With this approach the value of an object's primary key is stored as a row key
and the entire object is stored in a single column whose name is the value of the object's
secondary key and whose value is the entire object (serialized into a ByteBuffer). This results
in persisting potentially many objects in a single row.  All of those objects have the same
primary key and there are as many columns as there are objects with the same primary key.
 An example of this approach is a time series column family in which each row holds weather
readings for a different city and each column in a row holds all of the weather observations
for that city at a certain time.  The timestamp is used as a column name and an object holding
all the observations is serialized and stored in the corresponding column value.

Cassandra is a really powerful database, but it excels performance-wise with reading and writing
time series data stored using a dynamic column family.

There are variations of the above patterns.  You can use composite types to define a row key
or column name that are made up of values of multiple keys, for example.

I gave a presentation on the topic of Cassandra patterns recently to the Austin Cassandra
Meetup.  You can find my charts there in the archives or posted to my box at the linkedin
site below…. or contact me offline.

To bring this back to the original question.  Asking for the ability to apply a Java method
to selected rows makes sense for static column families, but I think the more general need
is to be able to apply a Java method to selected persisted objects in a column family regardless
of static or dynamic usage.  While I'm on my soapbox, I think this requirement applies to
Pig support as well.


Chris Gerken

On Feb 17, 2012, at 10:07 AM, Chris Gerken wrote:

> Don,
> That's a good idea, but you have to be careful not to preclude the use of dynamic column
families (e.g. CF's with time series-like schemas) which is what Cassandra's best at.  The
right approach is to build your own "ORM"/persistence layer (or generate one with some tools)
that can hide the API differences between static and dynamic CF's.  Once you're there, hadoop
and Pig both come very close to what you're asking for.
> In other words, you should be asking for a means to apply a Java method to selected objects
(not rows) that are persisted in a Cassandra column family.
> thx
> - Chris
> Chris Gerken
> 512.587.5261
> On Feb 17, 2012, at 9:35 AM, Don Smith wrote:
>> Are there plans to build-in some sort of map-reduce framework into Cassandra and
CQL?   It seems that users should be able to apply a Java method to selected rows in parallel
 on the distributed Cassandra JVMs.   I believe Solandra uses such an integration.
>> Don
>> ________________________________________
>> From: Alessio Cecchi []
>> Sent: Friday, February 17, 2012 4:42 AM
>> To:
>> Subject: General questions about Cassandra
>> Hi,
>> we have developed a software that store logs from mail servers in MySQL,
>> but for huge enviroments we are developing a version that store this
>> data in HBase. Raw logs are, once a day, first normalized, so the output
>> is like this:
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> [...]
>> and after inserted into the database.
>> As I was saying, for huge installation (from 1 to 10 million of logins
>> per day, keep for 12 months) we are working with HBase, but I would also
>> consider Cassandra.
>> The advantage of HBase is MapReduce which makes searching the logs very
>> fast by splitting the "query" concurrently on multiple hosts.
>> Query will be launched from a web interface (will be few requests per
>> day) and the search keys are user and time range.
>> But Cassandra seems less complex to manage and simply to run, so I want
>> to evaluate it instead of HBase.
>> My question is, can also Cassandra split a "query" over the cluster like
>> MapReduce? Reading on-line Cassandra seems fast in insert data but
>> slower than HBase to "query". Is it really so?
>> We want not install Hadoop over Cassandra.
>> Any suggestion is welcome :-)
>> --
>> Alessio Cecchi is:
>> @ ILS ->
>> on LinkedIn ->
>> Assistenza Sistemi GNU/Linux ->
>> @ PLUG ->  ex-Presidente, adesso senatore a vita,
>> @ LOLUG ->  Socio

View raw message