Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B39A9B4E for ; Sat, 18 Feb 2012 08:52:29 +0000 (UTC) Received: (qmail 27028 invoked by uid 500); 18 Feb 2012 08:52:26 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 26951 invoked by uid 500); 18 Feb 2012 08:52:25 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 26938 invoked by uid 99); 18 Feb 2012 08:52:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 08:52:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HS_INDEX_PARAM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of francesco.tangari.inf@gmail.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-ee0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 08:52:19 +0000 Received: by eekc41 with SMTP id c41so1608145eek.31 for ; Sat, 18 Feb 2012 00:51:59 -0800 (PST) Received-SPF: pass (google.com: domain of francesco.tangari.inf@gmail.com designates 10.213.19.130 as permitted sender) client-ip=10.213.19.130; Authentication-Results: mr.google.com; spf=pass (google.com: domain of francesco.tangari.inf@gmail.com designates 10.213.19.130 as permitted sender) smtp.mail=francesco.tangari.inf@gmail.com; dkim=pass header.i=francesco.tangari.inf@gmail.com Received: from mr.google.com ([10.213.19.130]) by 10.213.19.130 with SMTP id a2mr347815ebb.148.1329555119241 (num_hops = 1); Sat, 18 Feb 2012 00:51:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=date:from:to:message-id:in-reply-to:references:subject:x-mailer :mime-version:content-type; bh=vAgHiVCmCGF3lUu+PsXOLMYCYyw+8d0knEpCTiysAPQ=; b=x3b1f9PkOq1qyTv9kqmDvL1xre/XHgZGUGz1wWKRmFffwOmFUAeMN/3jmj6Uwir4x1 HNYCNJBwQb/WVgsZQGi9zQsbWJn1ACtUYeIypV2Ree52uA2Dxcc+D+tWWEzOctVbWI9c U9aOAj15+erfaxwAjNU4tTRV80orn6bkeNjuI= Received: by 10.213.19.130 with SMTP id a2mr289030ebb.148.1329555117909; Sat, 18 Feb 2012 00:51:57 -0800 (PST) Received: from macbook-di-francesco-tangari.local ([87.13.134.239]) by mx.google.com with ESMTPS id n56sm48832890eeh.6.2012.02.18.00.51.54 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 18 Feb 2012 00:51:56 -0800 (PST) Date: Sat, 18 Feb 2012 09:51:52 +0100 From: francesco.tangari.inf@gmail.com To: user@cassandra.apache.org Message-ID: <2AF5CC64A6BB43209B4911B3E82DB334@gmail.com> In-Reply-To: <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com> References: <4F3E4B4E.8000406@skye.it> <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com> <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com> Subject: Re: General questions about Cassandra X-Mailer: sparrow 1.5 (build 1043) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="4f3f66a8_643c9869_298" X-Virus-Checked: Checked by ClamAV on apache.org --4f3f66a8_643c9869_298 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline i suppose that he should buy http://shop.oreilly.com/product/063692001085= 2.do , to get an idea of what cassandra can and what can't. that's my per= sonal thinking. -- =20 francesco.tangari.inf=40gmail.com Inviato con Sparrow (http://www.sparrowmailapp.com/=3Fsig) Il giorno venerd=C3=AC 17 febbraio 2012, alle ore 17.59, Chris Gerken ha = scritto: =20 > In response to an offline question=E2=80=A6 > =20 > There are two usage patterns for Cassandra column families, static and = dynamic. With both approaches you store objects of a given type into a co= lumn family. > =20 > With static usage the object type you're persisting has a single key an= d each row in the column family maps to a single object. The value of an = object's key is stored in the row key and each of the object's properties= is stored in a column whose name is the name of the property and whose v= alue is the property value. There are the same number of columns in a row= as there are non-null property values. This usage is very much like trad= itional relational database usage. > =20 > With dynamic usage the object type to be persisted has two keys (I'll g= et to composite keys in a bit). With this approach the value of an object= 's primary key is stored as a row key and the entire object is stored in = a single column whose name is the value of the object's secondary key and= whose value is the entire object (serialized into a ByteBuffer). This re= sults in persisting potentially many objects in a single row. All of thos= e objects have the same primary key and there are as many columns as ther= e are objects with the same primary key. An example of this approach is a= time series column family in which each row holds weather readings for a= different city and each column in a row holds all of the weather observa= tions for that city at a certain time. The timestamp is used as a column = name and an object holding all the observations is serialized and stored = in the corresponding column value. > =20 > Cassandra is a really powerful database, but it excels performance-wise= with reading and writing time series data stored using a dynamic column = family. > =20 > There are variations of the above patterns. You can use composite types= to define a row key or column name that are made up of values of multipl= e keys, for example. > =20 > I gave a presentation on the topic of Cassandra patterns recently to th= e Austin Cassandra Meetup. You can find my charts there in the archives o= r posted to my box at the linkedin site below=E2=80=A6. or contact me off= line. > =20 > To bring this back to the original question. Asking for the ability to = apply a Java method to selected rows makes sense for static column famili= es, but I think the more general need is to be able to apply a Java metho= d to selected persisted objects in a column family regardless of static o= r dynamic usage. While I'm on my soapbox, I think this requirement applie= s to Pig support as well. > =20 > thx > =20 > Chris Gerken > =20 > chrisgerken=40mindspring.com (mailto:chrisgerken=40mindspring.com) > 512.587.5261 > http://www.linkedin.com/in/chgerken > =20 > =20 > =20 > On =46eb 17, 2012, at 10:07 AM, Chris Gerken wrote: > =20 > > Don, > > =20 > > That's a good idea, but you have to be careful not to preclude the us= e of dynamic column families (e.g. C=46's with time series-like schemas) = which is what Cassandra's best at. The right approach is to build your ow= n =22ORM=22/persistence layer (or generate one with some tools) that can = hide the API differences between static and dynamic C=46's. Once you're t= here, hadoop and Pig both come very close to what you're asking for. > > =20 > > In other words, you should be asking for a means to apply a Java meth= od to selected objects (not rows) that are persisted in a Cassandra colum= n family. > > =20 > > thx > > =20 > > - Chris > > =20 > > Chris Gerken > > =20 > > chrisgerken=40mindspring.com (mailto:chrisgerken=40mindspring.com) > > 512.587.5261 > > http://www.linkedin.com/in/chgerken > > =20 > > =20 > > =20 > > On =46eb 17, 2012, at 9:35 AM, Don Smith wrote: > > =20 > > > Are there plans to build-in some sort of map-reduce framework into = Cassandra and CQL=3F It seems that users should be able to apply a Java m= ethod to selected rows in parallel on the distributed Cassandra JVMs. I b= elieve Solandra uses such an integration. > > > =20 > > > Don > > > =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F > > > =46rom: Alessio Cecchi =5Balessio=40skye.it (mailto:alessio=40skye.= it)=5D > > > Sent: =46riday, =46ebruary 17, 2012 4:42 AM > > > To: user=40cassandra.apache.org (mailto:user=40cassandra.apache.org= ) > > > Subject: General questions about Cassandra > > > =20 > > > Hi, > > > =20 > > > we have developed a software that store logs from mail servers in M= ySQL, > > > but for huge enviroments we are developing a version that store thi= s > > > data in HBase. Raw logs are, once a day, first normalized, so the o= utput > > > is like this: > > > =20 > > > username,date of login, IP Address, protocol > > > username,date of login, IP Address, protocol > > > username,date of login, IP Address, protocol > > > =5B...=5D > > > =20 > > > and after inserted into the database. > > > =20 > > > As I was saying, for huge installation (from 1 to 10 million of log= ins > > > per day, keep for 12 months) we are working with HBase, but I would= also > > > consider Cassandra. > > > =20 > > > The advantage of HBase is MapReduce which makes searching the logs = very > > > fast by splitting the =22query=22 concurrently on multiple hosts. > > > =20 > > > Query will be launched from a web interface (will be few requests p= er > > > day) and the search keys are user and time range. > > > =20 > > > But Cassandra seems less complex to manage and simply to run, so I = want > > > to evaluate it instead of HBase. > > > =20 > > > My question is, can also Cassandra split a =22query=22 over the clu= ster like > > > MapReduce=3F Reading on-line Cassandra seems fast in insert data bu= t > > > slower than HBase to =22query=22. Is it really so=3F > > > =20 > > > We want not install Hadoop over Cassandra. > > > =20 > > > Any suggestion is welcome :-) > > > =20 > > > -- > > > Alessio Cecchi is: > > > =40 ILS -> http://www.linux.it/=7Ealessice/ > > > on LinkedIn -> http://www.linkedin.com/in/alessice > > > Assistenza Sistemi GNU/Linux -> http://www.cecchi.biz/ > > > =40 PLUG -> ex-Presidente, adesso senatore a vita, http://www.prato= .linux.it > > > =40 LOLUG -> Socio http://www.lolug.net > > > =20 > > =20 > > =20 > =20 > =20 > =20 --4f3f66a8_643c9869_298 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
i suppose that he should buy http://shop.oreilly.com/product/06369200= 10852.do , to get an idea of what cassandra can and what can't. that's my= personal thinking.

-- 
francesco.tan= gari.inf=40gmail.com
Inviato con Sparrow

=20

Il giorno venerd=C3=AC= 17 febbraio 2012, alle ore 17.59, Chris Gerken ha scritto:

In response to an offline questi= on=E2=80=A6

There are two usage patterns for Cas= sandra column families, static and dynamic. With both approaches you sto= re objects of a given type into a column family.

With static usage the object type you're persisting has a single key and= each row in the column family maps to a single object. The value of an = object's key is stored in the row key and each of the object's properties= is stored in a column whose name is the name of the property and whose v= alue is the property value. There are the same number of columns in a ro= w as there are non-null property values. This usage is very much like tra= ditional relational database usage.

With dynamic= usage the object type to be persisted has two keys (I'll get to composit= e keys in a bit). With this approach the value of an object's primary ke= y is stored as a row key and the entire object is stored in a single colu= mn whose name is the value of the object's secondary key and whose value = is the entire object (serialized into a ByteBuffer). This results in pers= isting potentially many objects in a single row. All of those objects ha= ve the same primary key and there are as many columns as there are object= s with the same primary key. An example of this approach is a time serie= s column family in which each row holds weather readings for a different = city and each column in a row holds all of the weather observations for t= hat city at a certain time. The timestamp is used as a column name and a= n object holding all the observations is serialized and stored in the cor= responding column value.

Cassandra is a really p= owerful database, but it excels performance-wise with reading and writing= time series data stored using a dynamic column family.

There are variations of the above patterns. You can use composit= e types to define a row key or column name that are made up of values of = multiple keys, for example.

I gave a presentatio= n on the topic of Cassandra patterns recently to the Austin Cassandra Mee= tup. You can find my charts there in the archives or posted to my box at= the linkedin site below=E2=80=A6. or contact me offline.

<= /div>
To bring this back to the original question. Asking for the ab= ility to apply a Java method to selected rows makes sense for static colu= mn families, but I think the more general need is to be able to apply a J= ava method to selected persisted objects in a column family regardless of= static or dynamic usage. While I'm on my soapbox, I think this requirem= ent applies to Pig support as well.

thx

Chris Gerken

<= div>512.587.5261


On =46eb 17, 2012, at 10:07 AM, Chris Gerken = wrote:

Don,

That's a good idea, but you have to be careful no= t to preclude the use of dynamic column families (e.g. C=46's with time s= eries-like schemas) which is what Cassandra's best at. The right approac= h is to build your own =22ORM=22/persistence layer (or generate one with = some tools) that can hide the API differences between static and dynamic = C=46's. Once you're there, hadoop and Pig both come very close to what y= ou're asking for.

In other words, you should be = asking for a means to apply a Java method to selected objects (not rows) = that are persisted in a Cassandra column family.

thx

- Chris

Chris Gerk= en

512.587.5261



On =46eb 17, 2012, at 9:35 AM, Don Smith wrote:

Are there plans to build-in some sort= of map-reduce framework into Cassandra and CQL=3F It seems that users = should be able to apply a Java method to selected rows in parallel on th= e distributed Cassandra JVMs. I believe Solandra uses such an integrati= on.

Don
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= =5F=5F=5F=5F=5F
=46rom: Alessio Cecchi =5Balessio=40skye.it=5D
Sent: =46riday, =46= ebruary 17, 2012 4:42 AM
Subject: Gener= al questions about Cassandra

Hi,

<= /div>
we have developed a software that store logs from mail servers = in MySQL,
but for huge enviroments we are developing a version = that store this
data in HBase. Raw logs are, once a day, first = normalized, so the output
is like this:

username,date of login, IP Address, protocol
username,date of= login, IP Address, protocol
username,date of login, IP Address= , protocol
=5B...=5D

and after inserte= d into the database.

As I was saying, for huge i= nstallation (from 1 to 10 million of logins
per day, keep for 1= 2 months) we are working with HBase, but I would also
consider = Cassandra.

The advantage of HBase is MapReduce w= hich makes searching the logs very
fast by splitting the =22que= ry=22 concurrently on multiple hosts.

Query will= be launched from a web interface (will be few requests per
day= ) and the search keys are user and time range.

B= ut Cassandra seems less complex to manage and simply to run, so I want
to evaluate it instead of HBase.

My ques= tion is, can also Cassandra split a =22query=22 over the cluster like
MapReduce=3F Reading on-line Cassandra seems fast in insert data b= ut
slower than HBase to =22query=22. Is it really so=3F

We want not install Hadoop over Cassandra.
Any suggestion is welcome :-)

--
Alessio Cecchi is:
on LinkedIn -> = http://www.linkedin.com/in/alessice
Assistenza Sistemi GNU/= Linux -> http://www.cecchi.biz<= /a>/
= =20 =20 =20 =20 =20

--4f3f66a8_643c9869_298--