Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B0A496A5 for ; Fri, 17 Feb 2012 16:59:35 +0000 (UTC) Received: (qmail 6054 invoked by uid 500); 17 Feb 2012 16:59:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 6015 invoked by uid 500); 17 Feb 2012 16:59:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 6007 invoked by uid 99); 17 Feb 2012 16:59:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2012 16:59:32 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.86.89.67] (HELO elasmtp-scoter.atl.sa.earthlink.net) (209.86.89.67) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2012 16:59:28 +0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=mindspring.com; b=oxBpnoxZ/ez9YxmrZu3aLGCjQ8tqs7D6L9S7SHdT99CKu5kKOdSimMX20wNt7q2u; h=Received:Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Content-Transfer-Encoding:Message-Id:References:To:X-Mailer:X-ELNK-Trace:X-Originating-IP; Received: from [208.124.48.188] (helo=[192.168.1.2]) by elasmtp-scoter.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from ) id 1RyR9H-0005rt-6o for user@cassandra.apache.org; Fri, 17 Feb 2012 11:59:07 -0500 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: General questions about Cassandra From: Chris Gerken In-Reply-To: <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com> Date: Fri, 17 Feb 2012 10:59:06 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: <5FD6573F-C775-4554-97DF-E499C074E999@mindspring.com> References: <4F3E4B4E.8000406@skye.it> <083A8A9C-AAB5-4807-AB1D-21362E5B6890@mindspring.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1084) X-ELNK-Trace: 89ce5298aa6460db6a4d017a94a0a954416dc04816f3191cb99f579a8ed514c6d98f3dc5235ae9f8c2984e6e4e9c8c1d350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 208.124.48.188 In response to an offline question=85 There are two usage patterns for Cassandra column families, static and = dynamic. With both approaches you store objects of a given type into a = column family. With static usage the object type you're persisting has a single key and = each row in the column family maps to a single object. The value of an = object's key is stored in the row key and each of the object's = properties is stored in a column whose name is the name of the property = and whose value is the property value. There are the same number of = columns in a row as there are non-null property values. This usage is = very much like traditional relational database usage. With dynamic usage the object type to be persisted has two keys (I'll = get to composite keys in a bit). With this approach the value of an = object's primary key is stored as a row key and the entire object is = stored in a single column whose name is the value of the object's = secondary key and whose value is the entire object (serialized into a = ByteBuffer). This results in persisting potentially many objects in a = single row. All of those objects have the same primary key and there = are as many columns as there are objects with the same primary key. An = example of this approach is a time series column family in which each = row holds weather readings for a different city and each column in a row = holds all of the weather observations for that city at a certain time. = The timestamp is used as a column name and an object holding all the = observations is serialized and stored in the corresponding column value. Cassandra is a really powerful database, but it excels performance-wise = with reading and writing time series data stored using a dynamic column = family. There are variations of the above patterns. You can use composite types = to define a row key or column name that are made up of values of = multiple keys, for example. I gave a presentation on the topic of Cassandra patterns recently to the = Austin Cassandra Meetup. You can find my charts there in the archives = or posted to my box at the linkedin site below=85. or contact me = offline. To bring this back to the original question. Asking for the ability to = apply a Java method to selected rows makes sense for static column = families, but I think the more general need is to be able to apply a = Java method to selected persisted objects in a column family regardless = of static or dynamic usage. While I'm on my soapbox, I think this = requirement applies to Pig support as well. thx Chris Gerken chrisgerken@mindspring.com 512.587.5261 http://www.linkedin.com/in/chgerken On Feb 17, 2012, at 10:07 AM, Chris Gerken wrote: > Don, >=20 > That's a good idea, but you have to be careful not to preclude the use = of dynamic column families (e.g. CF's with time series-like schemas) = which is what Cassandra's best at. The right approach is to build your = own "ORM"/persistence layer (or generate one with some tools) that can = hide the API differences between static and dynamic CF's. Once you're = there, hadoop and Pig both come very close to what you're asking for. >=20 > In other words, you should be asking for a means to apply a Java = method to selected objects (not rows) that are persisted in a Cassandra = column family. >=20 > thx >=20 > - Chris >=20 > Chris Gerken >=20 > chrisgerken@mindspring.com > 512.587.5261 > http://www.linkedin.com/in/chgerken >=20 >=20 >=20 > On Feb 17, 2012, at 9:35 AM, Don Smith wrote: >=20 >> Are there plans to build-in some sort of map-reduce framework into = Cassandra and CQL? It seems that users should be able to apply a Java = method to selected rows in parallel on the distributed Cassandra JVMs. = I believe Solandra uses such an integration. >>=20 >> Don >> ________________________________________ >> From: Alessio Cecchi [alessio@skye.it] >> Sent: Friday, February 17, 2012 4:42 AM >> To: user@cassandra.apache.org >> Subject: General questions about Cassandra >>=20 >> Hi, >>=20 >> we have developed a software that store logs from mail servers in = MySQL, >> but for huge enviroments we are developing a version that store this >> data in HBase. Raw logs are, once a day, first normalized, so the = output >> is like this: >>=20 >> username,date of login, IP Address, protocol >> username,date of login, IP Address, protocol >> username,date of login, IP Address, protocol >> [...] >>=20 >> and after inserted into the database. >>=20 >> As I was saying, for huge installation (from 1 to 10 million of = logins >> per day, keep for 12 months) we are working with HBase, but I would = also >> consider Cassandra. >>=20 >> The advantage of HBase is MapReduce which makes searching the logs = very >> fast by splitting the "query" concurrently on multiple hosts. >>=20 >> Query will be launched from a web interface (will be few requests per >> day) and the search keys are user and time range. >>=20 >> But Cassandra seems less complex to manage and simply to run, so I = want >> to evaluate it instead of HBase. >>=20 >> My question is, can also Cassandra split a "query" over the cluster = like >> MapReduce? Reading on-line Cassandra seems fast in insert data but >> slower than HBase to "query". Is it really so? >>=20 >> We want not install Hadoop over Cassandra. >>=20 >> Any suggestion is welcome :-) >>=20 >> -- >> Alessio Cecchi is: >> @ ILS -> http://www.linux.it/~alessice/ >> on LinkedIn -> http://www.linkedin.com/in/alessice >> Assistenza Sistemi GNU/Linux -> http://www.cecchi.biz/ >> @ PLUG -> ex-Presidente, adesso senatore a vita, = http://www.prato.linux.it >> @ LOLUG -> Socio http://www.lolug.net >>=20 >=20