Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=dk20050327; d=mindspring.com;
  b=oxBpnoxZ/ez9YxmrZu3aLGCjQ8tqs7D6L9S7SHdT99CKu5kKOdSimMX20wNt7q2u;
  h=Received:Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Content-Transfer-Encoding:Message-Id:References:To:X-Mailer:X-ELNK-Trace:X-Originating-IP;
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: General questions about Cassandra
From: Chris Gerken <chrisgerken@mindspring.com>
In-Reply-To: <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com>
Date: Fri, 17 Feb 2012 10:59:06 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com>
References: <4F3E4B4E.8000406@skye.it>
 <E43E8BF30ABE584D8E982C1D5E2315FD010D5397@mbx025-e1-nj-2.exch025.domain.local>
 <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com>
To: user@cassandra.apache.org

In response to an offline question=85

There are two usage patterns for Cassandra column families, static and =
dynamic.  With both approaches you store objects of a given type into a =
column family.

With static usage the object type you're persisting has a single key and =
each row in the column family maps to a single object.  The value of an =
object's key is stored in the row key and each of the object's =
properties is stored in a column whose name is the name of the property =
and whose value is the property value.  There are the same number of =
columns in a row as there are non-null property values. This usage is =
very much like traditional relational database usage.

With dynamic usage the object type to be persisted has two keys (I'll =
get to composite keys in a bit).  With this approach the value of an =
object's primary key is stored as a row key and the entire object is =
stored in a single column whose name is the value of the object's =
secondary key and whose value is the entire object (serialized into a =
ByteBuffer). This results in persisting potentially many objects in a =
single row.  All of those objects have the same primary key and there =
are as many columns as there are objects with the same primary key.  An =
example of this approach is a time series column family in which each =
row holds weather readings for a different city and each column in a row =
holds all of the weather observations for that city at a certain time.  =
The timestamp is used as a column name and an object holding all the =
observations is serialized and stored in the corresponding column value.

Cassandra is a really powerful database, but it excels performance-wise =
with reading and writing time series data stored using a dynamic column =
family.

There are variations of the above patterns.  You can use composite types =
to define a row key or column name that are made up of values of =
multiple keys, for example.

I gave a presentation on the topic of Cassandra patterns recently to the =
Austin Cassandra Meetup.  You can find my charts there in the archives =
or posted to my box at the linkedin site below=85. or contact me =
offline.

To bring this back to the original question.  Asking for the ability to =
apply a Java method to selected rows makes sense for static column =
families, but I think the more general need is to be able to apply a =
Java method to selected persisted objects in a column family regardless =
of static or dynamic usage.  While I'm on my soapbox, I think this =
requirement applies to Pig support as well.

thx

Chris Gerken

chrisgerken@mindspring.com
512.587.5261
http://www.linkedin.com/in/chgerken


On Feb 17, 2012, at 10:07 AM, Chris Gerken wrote:

> Don,
>=20
> That's a good idea, but you have to be careful not to preclude the use =
of dynamic column families (e.g. CF's with time series-like schemas) =
which is what Cassandra's best at.  The right approach is to build your =
own "ORM"/persistence layer (or generate one with some tools) that can =
hide the API differences between static and dynamic CF's.  Once you're =
there, hadoop and Pig both come very close to what you're asking for.
>=20
> In other words, you should be asking for a means to apply a Java =
method to selected objects (not rows) that are persisted in a Cassandra =
column family.
>=20
> thx
>=20
> - Chris
>=20
> Chris Gerken
>=20
> chrisgerken@mindspring.com
> 512.587.5261
> http://www.linkedin.com/in/chgerken
>=20
>=20
>=20
> On Feb 17, 2012, at 9:35 AM, Don Smith wrote:
>=20
>> Are there plans to build-in some sort of map-reduce framework into =
Cassandra and CQL?   It seems that users should be able to apply a Java =
method to selected rows in parallel  on the distributed Cassandra JVMs.  =
 I believe Solandra uses such an integration.
>>=20
>> Don
>> ________________________________________
>> From: Alessio Cecchi [alessio@skye.it]
>> Sent: Friday, February 17, 2012 4:42 AM
>> To: user@cassandra.apache.org
>> Subject: General questions about Cassandra
>>=20
>> Hi,
>>=20
>> we have developed a software that store logs from mail servers in =
MySQL,
>> but for huge enviroments we are developing a version that store this
>> data in HBase. Raw logs are, once a day, first normalized, so the =
output
>> is like this:
>>=20
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> [...]
>>=20
>> and after inserted into the database.
>>=20
>> As I was saying, for huge installation (from 1 to 10 million of =
logins
>> per day, keep for 12 months) we are working with HBase, but I would =
also
>> consider Cassandra.
>>=20
>> The advantage of HBase is MapReduce which makes searching the logs =
very
>> fast by splitting the "query" concurrently on multiple hosts.
>>=20
>> Query will be launched from a web interface (will be few requests per
>> day) and the search keys are user and time range.
>>=20
>> But Cassandra seems less complex to manage and simply to run, so I =
want
>> to evaluate it instead of HBase.
>>=20
>> My question is, can also Cassandra split a "query" over the cluster =
like
>> MapReduce? Reading on-line Cassandra seems fast in insert data but
>> slower than HBase to "query". Is it really so?
>>=20
>> We want not install Hadoop over Cassandra.
>>=20
>> Any suggestion is welcome :-)
>>=20
>> --
>> Alessio Cecchi is:
>> @ ILS ->  http://www.linux.it/~alessice/
>> on LinkedIn ->  http://www.linkedin.com/in/alessice
>> Assistenza Sistemi GNU/Linux ->  http://www.cecchi.biz/
>> @ PLUG ->  ex-Presidente, adesso senatore a vita, =
http://www.prato.linux.it
>> @ LOLUG ->  Socio http://www.lolug.net
>>=20
>=20