Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of francesco.tangari.inf@gmail.com
 designates 74.125.83.44 as permitted sender)
Received-SPF: pass (google.com: domain of francesco.tangari.inf@gmail.com
 designates 10.213.19.130 as permitted sender) client-ip=10.213.19.130;
Date: Sat, 18 Feb 2012 09:51:52 +0100
From: francesco.tangari.inf@gmail.com
To: user@cassandra.apache.org
Message-ID: <2AF5CC64A6BB43209B4911B3E82DB334@gmail.com>
In-Reply-To: <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com>
References: <4F3E4B4E.8000406@skye.it>
 <E43E8BF30ABE584D8E982C1D5E2315FD010D5397@mbx025-e1-nj-2.exch025.domain.local>
 <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com>
 <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com>
Subject: Re: General questions about Cassandra
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="4f3f66a8_643c9869_298"

--4f3f66a8_643c9869_298
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

i suppose that he should buy http://shop.oreilly.com/product/063692001085=
2.do , to get an idea of what cassandra can and what can't. that's my per=
sonal thinking.

-- =20
francesco.tangari.inf=40gmail.com
Inviato con Sparrow (http://www.sparrowmailapp.com/=3Fsig)


Il giorno venerd=C3=AC 17 febbraio 2012, alle ore 17.59, Chris Gerken ha =
scritto: =20

> In response to an offline question=E2=80=A6
> =20
> There are two usage patterns for Cassandra column families, static and =
dynamic. With both approaches you store objects of a given type into a co=
lumn family.
> =20
> With static usage the object type you're persisting has a single key an=
d each row in the column family maps to a single object. The value of an =
object's key is stored in the row key and each of the object's properties=
 is stored in a column whose name is the name of the property and whose v=
alue is the property value. There are the same number of columns in a row=
 as there are non-null property values. This usage is very much like trad=
itional relational database usage.
> =20
> With dynamic usage the object type to be persisted has two keys (I'll g=
et to composite keys in a bit). With this approach the value of an object=
's primary key is stored as a row key and the entire object is stored in =
a single column whose name is the value of the object's secondary key and=
 whose value is the entire object (serialized into a ByteBuffer). This re=
sults in persisting potentially many objects in a single row. All of thos=
e objects have the same primary key and there are as many columns as ther=
e are objects with the same primary key. An example of this approach is a=
 time series column family in which each row holds weather readings for a=
 different city and each column in a row holds all of the weather observa=
tions for that city at a certain time. The timestamp is used as a column =
name and an object holding all the observations is serialized and stored =
in the corresponding column value.
> =20
> Cassandra is a really powerful database, but it excels performance-wise=
 with reading and writing time series data stored using a dynamic column =
family.
> =20
> There are variations of the above patterns. You can use composite types=
 to define a row key or column name that are made up of values of multipl=
e keys, for example.
> =20
> I gave a presentation on the topic of Cassandra patterns recently to th=
e Austin Cassandra Meetup. You can find my charts there in the archives o=
r posted to my box at the linkedin site below=E2=80=A6. or contact me off=
line.
> =20
> To bring this back to the original question. Asking for the ability to =
apply a Java method to selected rows makes sense for static column famili=
es, but I think the more general need is to be able to apply a Java metho=
d to selected persisted objects in a column family regardless of static o=
r dynamic usage. While I'm on my soapbox, I think this requirement applie=
s to Pig support as well.
> =20
> thx
> =20
> Chris Gerken
> =20
> chrisgerken=40mindspring.com (mailto:chrisgerken=40mindspring.com)
> 512.587.5261
> http://www.linkedin.com/in/chgerken
> =20
> =20
> =20
> On =46eb 17, 2012, at 10:07 AM, Chris Gerken wrote:
> =20
> > Don,
> > =20
> > That's a good idea, but you have to be careful not to preclude the us=
e of dynamic column families (e.g. C=46's with time series-like schemas) =
which is what Cassandra's best at. The right approach is to build your ow=
n =22ORM=22/persistence layer (or generate one with some tools) that can =
hide the API differences between static and dynamic C=46's. Once you're t=
here, hadoop and Pig both come very close to what you're asking for.
> > =20
> > In other words, you should be asking for a means to apply a Java meth=
od to selected objects (not rows) that are persisted in a Cassandra colum=
n family.
> > =20
> > thx
> > =20
> > - Chris
> > =20
> > Chris Gerken
> > =20
> > chrisgerken=40mindspring.com (mailto:chrisgerken=40mindspring.com)
> > 512.587.5261
> > http://www.linkedin.com/in/chgerken
> > =20
> > =20
> > =20
> > On =46eb 17, 2012, at 9:35 AM, Don Smith wrote:
> > =20
> > > Are there plans to build-in some sort of map-reduce framework into =
Cassandra and CQL=3F It seems that users should be able to apply a Java m=
ethod to selected rows in parallel on the distributed Cassandra JVMs. I b=
elieve Solandra uses such an integration.
> > > =20
> > > Don
> > > =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F
> > > =46rom: Alessio Cecchi =5Balessio=40skye.it (mailto:alessio=40skye.=
it)=5D
> > > Sent: =46riday, =46ebruary 17, 2012 4:42 AM
> > > To: user=40cassandra.apache.org (mailto:user=40cassandra.apache.org=
)
> > > Subject: General questions about Cassandra
> > > =20
> > > Hi,
> > > =20
> > > we have developed a software that store logs from mail servers in M=
ySQL,
> > > but for huge enviroments we are developing a version that store thi=
s
> > > data in HBase. Raw logs are, once a day, first normalized, so the o=
utput
> > > is like this:
> > > =20
> > > username,date of login, IP Address, protocol
> > > username,date of login, IP Address, protocol
> > > username,date of login, IP Address, protocol
> > > =5B...=5D
> > > =20
> > > and after inserted into the database.
> > > =20
> > > As I was saying, for huge installation (from 1 to 10 million of log=
ins
> > > per day, keep for 12 months) we are working with HBase, but I would=
 also
> > > consider Cassandra.
> > > =20
> > > The advantage of HBase is MapReduce which makes searching the logs =
very
> > > fast by splitting the =22query=22 concurrently on multiple hosts.
> > > =20
> > > Query will be launched from a web interface (will be few requests p=
er
> > > day) and the search keys are user and time range.
> > > =20
> > > But Cassandra seems less complex to manage and simply to run, so I =
want
> > > to evaluate it instead of HBase.
> > > =20
> > > My question is, can also Cassandra split a =22query=22 over the clu=
ster like
> > > MapReduce=3F Reading on-line Cassandra seems fast in insert data bu=
t
> > > slower than HBase to =22query=22. Is it really so=3F
> > > =20
> > > We want not install Hadoop over Cassandra.
> > > =20
> > > Any suggestion is welcome :-)
> > > =20
> > > --
> > > Alessio Cecchi is:
> > > =40 ILS -> http://www.linux.it/=7Ealessice/
> > > on LinkedIn -> http://www.linkedin.com/in/alessice
> > > Assistenza Sistemi GNU/Linux -> http://www.cecchi.biz/
> > > =40 PLUG -> ex-Presidente, adesso senatore a vita, http://www.prato=
.linux.it
> > > =40 LOLUG -> Socio http://www.lolug.net
> > > =20
> > =20
> > =20
> =20
> =20
> =20


--4f3f66a8_643c9869_298
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


                <div style=3D=22font-family: Helvetica; font-size: 13px; =
=22>i suppose that he should buy http://shop.oreilly.com/product/06369200=
10852.do , to get an idea of what cassandra can and what can't. that's my=
 personal thinking.<br></div>
                <div><div><br></div><div>--&nbsp;</div><div>francesco.tan=
gari.inf=40gmail.com</div><div>Inviato con <a href=3D=22http://www.sparro=
wmailapp.com/=3Fsig=22>Sparrow</a></div><div><br></div></div>
                =20
                <p style=3D=22color: =23A0A0A8;=22>Il giorno venerd=C3=AC=
 17 febbraio 2012, alle ore 17.59, Chris Gerken ha scritto:  </p>
                <blockquote type=3D=22cite=22 style=3D=22border-left-styl=
e:solid;border-width:1px;margin-left:0px;padding-left:10px;=22>
                    <span><div><div><div>In response to an offline questi=
on=E2=80=A6</div><div><br></div><div>There are two usage patterns for Cas=
sandra column families, static and dynamic.  With both approaches you sto=
re objects of a given type into a column family.</div><div><br></div><div=
>With static usage the object type you're persisting has a single key and=
 each row in the column family maps to a single object.  The value of an =
object's key is stored in the row key and each of the object's properties=
 is stored in a column whose name is the name of the property and whose v=
alue is the property value.  There are the same number of columns in a ro=
w as there are non-null property values. This usage is very much like tra=
ditional relational database usage.</div><div><br></div><div>With dynamic=
 usage the object type to be persisted has two keys (I'll get to composit=
e keys in a bit).  With this approach the value of an object's primary ke=
y is stored as a row key and the entire object is stored in a single colu=
mn whose name is the value of the object's secondary key and whose value =
is the entire object (serialized into a ByteBuffer). This results in pers=
isting potentially many objects in a single row.  All of those objects ha=
ve the same primary key and there are as many columns as there are object=
s with the same primary key.  An example of this approach is a time serie=
s column family in which each row holds weather readings for a different =
city and each column in a row holds all of the weather observations for t=
hat city at a certain time.  The timestamp is used as a column name and a=
n object holding all the observations is serialized and stored in the cor=
responding column value.</div><div><br></div><div>Cassandra is a really p=
owerful database, but it excels performance-wise with reading and writing=
 time series data stored using a dynamic column family.</div><div><br></d=
iv><div>There are variations of the above patterns.  You can use composit=
e types to define a row key or column name that are made up of values of =
multiple keys, for example.</div><div><br></div><div>I gave a presentatio=
n on the topic of Cassandra patterns recently to the Austin Cassandra Mee=
tup.  You can find my charts there in the archives or posted to my box at=
 the linkedin site below=E2=80=A6. or contact me offline.</div><div><br><=
/div><div>To bring this back to the original question.  Asking for the ab=
ility to apply a Java method to selected rows makes sense for static colu=
mn families, but I think the more general need is to be able to apply a J=
ava method to selected persisted objects in a column family regardless of=
 static or dynamic usage.  While I'm on my soapbox, I think this requirem=
ent applies to Pig support as well.</div><div><br></div><div>thx</div><di=
v><br></div><div>Chris Gerken</div><div><br></div><div><a href=3D=22mailt=
o:chrisgerken=40mindspring.com=22>chrisgerken=40mindspring.com</a></div><=
div>512.587.5261</div><div><a href=3D=22http://www.linkedin.com/in/chgerk=
en=22>http://www.linkedin.com/in/chgerken</a></div><div><br></div><div><b=
r></div><div><br></div><div>On =46eb 17, 2012, at 10:07 AM, Chris Gerken =
wrote:</div><div><br></div><blockquote type=3D=22cite=22><div><div>Don,</=
div><div><br></div><div>That's a good idea, but you have to be careful no=
t to preclude the use of dynamic column families (e.g. C=46's with time s=
eries-like schemas) which is what Cassandra's best at.  The right approac=
h is to build your own =22ORM=22/persistence layer (or generate one with =
some tools) that can hide the API differences between static and dynamic =
C=46's.  Once you're there, hadoop and Pig both come very close to what y=
ou're asking for.</div><div><br></div><div>In other words, you should be =
asking for a means to apply a Java method to selected objects (not rows) =
that are persisted in a Cassandra column family.</div><div><br></div><div=
>thx</div><div><br></div><div>- Chris</div><div><br></div><div>Chris Gerk=
en</div><div><br></div><div><a href=3D=22mailto:chrisgerken=40mindspring.=
com=22>chrisgerken=40mindspring.com</a></div><div>512.587.5261</div><div>=
<a href=3D=22http://www.linkedin.com/in/chgerken=22>http://www.linkedin.c=
om/in/chgerken</a></div><div><br></div><div><br></div><div><br></div><div=
>On =46eb 17, 2012, at 9:35 AM, Don Smith wrote:</div><div><br></div><blo=
ckquote type=3D=22cite=22><div><div>Are there plans to build-in some sort=
 of map-reduce framework into Cassandra and CQL=3F   It seems that users =
should be able to apply a Java method to selected rows in parallel  on th=
e distributed Cassandra JVMs.   I believe Solandra uses such an integrati=
on.</div><div><br></div><div>Don</div><div>=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
=5F=5F=5F=5F=5F</div><div>=46rom: Alessio Cecchi =5B<a href=3D=22mailto:a=
lessio=40skye.it=22>alessio=40skye.it</a>=5D</div><div>Sent: =46riday, =46=
ebruary 17, 2012 4:42 AM</div><div>To: <a href=3D=22mailto:user=40cassand=
ra.apache.org=22>user=40cassandra.apache.org</a></div><div>Subject: Gener=
al questions about Cassandra</div><div><br></div><div>Hi,</div><div><br><=
/div><div>we have developed a software that store logs from mail servers =
in MySQL,</div><div>but for huge enviroments we are developing a version =
that store this</div><div>data in HBase. Raw logs are, once a day, first =
normalized, so the output</div><div>is like this:</div><div><br></div><di=
v>username,date of login, IP Address, protocol</div><div>username,date of=
 login, IP Address, protocol</div><div>username,date of login, IP Address=
, protocol</div><div>=5B...=5D</div><div><br></div><div>and after inserte=
d into the database.</div><div><br></div><div>As I was saying, for huge i=
nstallation (from 1 to 10 million of logins</div><div>per day, keep for 1=
2 months) we are working with HBase, but I would also</div><div>consider =
Cassandra.</div><div><br></div><div>The advantage of HBase is MapReduce w=
hich makes searching the logs very</div><div>fast by splitting the =22que=
ry=22 concurrently on multiple hosts.</div><div><br></div><div>Query will=
 be launched from a web interface (will be few requests per</div><div>day=
) and the search keys are user and time range.</div><div><br></div><div>B=
ut Cassandra seems less complex to manage and simply to run, so I want</d=
iv><div>to evaluate it instead of HBase.</div><div><br></div><div>My ques=
tion is, can also Cassandra split a =22query=22 over the cluster like</di=
v><div>MapReduce=3F Reading on-line Cassandra seems fast in insert data b=
ut</div><div>slower than HBase to =22query=22. Is it really so=3F</div><d=
iv><br></div><div>We want not install Hadoop over Cassandra.</div><div><b=
r></div><div>Any suggestion is welcome :-)</div><div><br></div><div>--</d=
iv><div>Alessio Cecchi is:</div><div>=40 ILS -&gt;  <a href=3D=22http://w=
ww.linux.it/=7Ealessice/=22>http://www.linux.it/=7Ealessice/</a></div><di=
v>on LinkedIn -&gt;  <a href=3D=22http://www.linkedin.com/in/alessice=22>=
http://www.linkedin.com/in/alessice</a></div><div>Assistenza Sistemi GNU/=
Linux -&gt;  <a href=3D=22http://www.cecchi.biz=22>http://www.cecchi.biz<=
/a>/</div><div>=40 PLUG -&gt;  ex-Presidente, adesso senatore a vita, <a =
href=3D=22http://www.prato.linux.it=22>http://www.prato.linux.it</a></div=
><div>=40 LOLUG -&gt;  Socio <a href=3D=22http://www.lolug.net=22>http://=
www.lolug.net</a></div></div></blockquote></div></blockquote></div></div>=
</span>
                =20
                =20
                =20
                =20
                </blockquote>
                =20
                <div>
                    <br>
                </div>
            
--4f3f66a8_643c9869_298--