Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA5677522 for ; Wed, 31 Aug 2011 05:31:32 +0000 (UTC) Received: (qmail 33787 invoked by uid 500); 31 Aug 2011 05:31:30 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 32598 invoked by uid 500); 31 Aug 2011 05:31:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 32570 invoked by uid 99); 31 Aug 2011 05:31:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2011 05:31:15 +0000 X-ASF-Spam-Status: No, hits=4.1 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mccloud35@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2011 05:31:09 +0000 Received: by gyf3 with SMTP id 3so343232gyf.31 for ; Tue, 30 Aug 2011 22:30:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Ve53hfAFvVyL9LCBXNG4F8wcMYhh28oYKJz9pu4Fqio=; b=szdajkP7DwymX1E7mZh7Z8TEl/9vs0P9VQcIAdt7xJfCY9KZuTjhzKEXIGOeqGTFwS kFyv8sIOwbBLqBd2Bc19FS/r62uXvGH2t9Sj8nmQbGFjcph0/W6FpM0/DcSP5pLMltuk kDu9jc/+LdTCWAPlg0B+EOvUph8bzSadhmQ48= MIME-Version: 1.0 Received: by 10.236.177.37 with SMTP id c25mr259207yhm.123.1314768648232; Tue, 30 Aug 2011 22:30:48 -0700 (PDT) Received: by 10.236.44.4 with HTTP; Tue, 30 Aug 2011 22:30:48 -0700 (PDT) In-Reply-To: <72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com> References: <6093B199-A124-42B7-9F88-4A2D8482F709@gmail.com> <72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com> Date: Wed, 31 Aug 2011 11:00:48 +0530 Message-ID: Subject: Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS From: Tharindu Mathew To: user@pig.apache.org Cc: user@cassandra.apache.org, user@hive.apache.org Content-Type: multipart/alternative; boundary=20cf303f6a1c3d5e5904abc66ea5 --20cf303f6a1c3d5e5904abc66ea5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Jeremy. These will be really useful. On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna wrote: > I've tried to help out with some UDFs and references that help with our u= se > case: https://github.com/jeromatron/pygmalion/ > > There are some brisk docs on pig as well that might be helpful: > http://www.datastax.com/docs/0.8/brisk/about_pig > > On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote: > > > Thanks Jeremy for your response. That gives me some encouragement, that= I > might be on that right track. > > > > I think I need to try out more stuff before coming to a conclusion on > Brisk. > > > > For Pig operations over Cassandra, I only could find > http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there an= y > other resource that you can point me to? There seems to be a lack of samp= les > on this subject. > > > > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna < > jeremy.hanna1234@gmail.com> wrote: > > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to > potentially move to Brisk because of the simplicity of operations there. > > > > Not sure what you mean about the true power of Hadoop. In my mind the > true power of Hadoop is the ability to parallelize jobs and send each tas= k > to where the data resides. HDFS exists to enable that. Brisk is just > another HDFS compatible implementation. If you're already storing your d= ata > in Cassandra and are looking to use Hadoop with it, then I would seriousl= y > consider using Brisk. > > > > That said, Cassandra with Hadoop works fine. > > > > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote: > > > > > Hi Eric, > > > > > > Thanks for your response. > > > > > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa > wrote: > > > > > >> Hi Tharindu, try having a look at Brisk( > > >> http://www.datastax.com/products/brisk) it integrates Hadoop with > > >> Cassandra and is shipped with Hive for SQL analysis. You can then > install > > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in > order > > >> to enable data import/export between Hadoop and MySQL. > > >> Does this sound ok to you ? > > >> > > >> These do sound ok. But I was looking at using something from Apache > itself. > > > > > > Brisk sounds nice, but I feel that disregarding HDFS and totally > switching > > > to Cassandra is not the right thing to do. Just my opinion there. I > feel we > > > are not using the true power of Hadoop then. > > > > > > I feel Pig has more integration with Cassandra, so I might take a loo= k > > > there. > > > > > > Whichever I choose, I will contribute the code back to the Apache > projects I > > > use. Here's a sample data analysis I do with my language. Maybe, ther= e > is no > > > generic way to do what I want to do. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> 2011/8/29 Tharindu Mathew > > >> > > >>> Hi, > > >>> > > >>> I have an already running system where I define a simple data flow > (using > > >>> a simple custom data flow language) and configure jobs to run again= st > stored > > >>> data. I use quartz to schedule and run these jobs and the data exis= ts > on > > >>> various data stores (mainly Cassandra but some data exists in RDBMS > like > > >>> mysql as well). > > >>> > > >>> Thinking about scalability and already existing support for standar= d > data > > >>> flow languages in the form of Pig and HiveQL, I plan to move my > system to > > >>> Hadoop. > > >>> > > >>> I've seen some efforts on the integration of Cassandra and Hadoop. > I've > > >>> been reading up and still am contemplating on how to make this > change. > > >>> > > >>> It would be great to hear the recommended approach of doing this on > Hadoop > > >>> with the integration of Cassandra and other RDBMS. For example, a > sample > > >>> task that already runs on the system is "once in every hour, get ro= ws > from > > >>> column family X, aggregate data in columns A, B and C and write bac= k > to > > >>> column family Y, and enter details of last aggregated row into a > table in > > >>> mysql" > > >>> > > >>> Thanks in advance. > > >>> > > >>> -- > > >>> Regards, > > >>> > > >>> Tharindu > > >>> > > >> > > >> > > >> > > >> -- > > >> *Eric Djatsa Yota* > > >> *Double degree MsC Student in Computer Science Engineering and > > >> Communication Networks > > >> T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY)* > > >> *Intern at AMADEUS S.A.S Sophia Antipolis* > > >> djatsaedy@gmail.com > > >> *Tel : 0601791859* > > >> > > >> > > > > > > > > > -- > > > Regards, > > > > > > Tharindu > > > > > > > > > > -- > > Regards, > > > > Tharindu > > --=20 Regards, Tharindu --20cf303f6a1c3d5e5904abc66ea5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna <jeremy.hanna1234@gmail.com&g= t; wrote:
I've tried to help out with some UDFs a= nd references that help with our use case: https://github.com/jeromatron/pygmal= ion/

There are some brisk docs on pig as well that might be helpful: http://= www.datastax.com/docs/0.8/brisk/about_pig

On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:

> Thanks Jeremy for your response. That gives me some encouragement, tha= t I might be on that right track.
>
> I think I need to try out more stuff before coming to a conclusion on = Brisk.
>
> For Pig operations over Cassandra, I only could find ht= tp://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there an= y other resource that you can point me to? There seems to be a lack of samp= les on this subject.
>
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <jeremy.hanna1234@gmail.com> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to = potentially move to Brisk because of the simplicity of operations there. >
> Not sure what you mean about the true power of Hadoop. =A0In my mind t= he true power of Hadoop is the ability to parallelize jobs and send each ta= sk to where the data resides. =A0HDFS exists to enable that. =A0Brisk is ju= st another HDFS compatible implementation. =A0If you're already storing= your data in Cassandra and are looking to use Hadoop with it, then I would= seriously consider using Brisk.
>
> That said, Cassandra with Hadoop works fine.
>
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
>
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsaedy@gmail.com> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop wi= th
> >> Cassandra and is shipped with Hive for SQL analysis. You can = then install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Had= oop in order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from = Apache itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally = switching
> > to Cassandra is not the right thing to do. Just my opinion there.= I feel we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a= look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache= projects I
> > use. Here's a sample data analysis I do with my language. May= be, there is no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name=3D"NodeId">
> > <index name=3D"ServerName" start=3D"" end= =3D""/>
> > <!--<index name=3D"nodeId" start=3D"AS"= end=3D"FB"/>-->
> > <!--<groupBy index=3D"nodeId"/>-->
> > <granularity index=3D"timeStamp" type=3D"hour&q= uot;/>
> > </get>
> >
> > <lookup name=3D"Event"/>
> >
> > <aggregate>
> > <measure name=3D"RequestCount" aggregationType=3D&qu= ot;CUMULATIVE"/>
> > <measure name=3D"ResponseCount" aggregationType=3D&q= uot;CUMULATIVE"/>
> > <measure name=3D"MaximumResponseTime" aggregationTyp= e=3D"AVG"/>
> > </aggregate>
> >
> > <put name=3D"NodeResult" indexRow=3D"allKeys&qu= ot;/>
> >
> > <log/>
> >
> > <get name=3D"NodeResult">
> > <index name=3D"ServerName" start=3D"" end= =3D""/>
> > <groupBy index=3D"ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name=3D"RequestCount" aggregationType=3D&qu= ot;CUMULATIVE"/>
> > <measure name=3D"ResponseCount" aggregationType=3D&q= uot;CUMULATIVE"/>
> > <measure name=3D"MaximumResponseTime" aggregationTyp= e=3D"AVG"/>
> > </aggregate>
> >
> > <put name=3D"NodeAccumilator" indexRow=3D"allKe= ys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mccloud35@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple = data flow (using
> >>> a simple custom data flow language) and configure jobs to= run against stored
> >>> data. I use quartz to schedule and run these jobs and the= data exists on
> >>> various data stores (mainly Cassandra but some data exist= s in RDBMS like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support f= or standard data
> >>> flow languages in the form of Pig and HiveQL, I plan to m= ove my system to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandr= a and Hadoop. I've
> >>> been reading up and still am contemplating on how to make= this change.
> >>>
> >>> It would be great to hear the recommended approach of doi= ng this on Hadoop
> >>> with the integration of Cassandra and other RDBMS. For ex= ample, a sample
> >>> task that already runs on the system is "once in eve= ry hour, get rows from
> >>> column family X, aggregate data in columns A, B and C and= write back to
> >>> column family Y, and enter details of last aggregated row= into a table in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering an= d
> >> Communication Networks
> >> T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY= )*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>
>
>
> --
> Regards,
>
> Tharindu




--
= Regards,

Tharindu
--20cf303f6a1c3d5e5904abc66ea5--