Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of jeremy.hanna1234@gmail.com
 designates 209.85.213.44 as permitted sender)
Subject: Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS
Mime-Version: 1.0 (Apple Message framework v1244.3)
Content-Type: text/plain; charset=iso-8859-1
From: Jeremy Hanna <jeremy.hanna1234@gmail.com>
In-Reply-To: 
 <CAFFf+k_+XfKKGaSW26YeBe9KF6nC4AYuqpJZ-Jr4cQZxSnjbNQ@mail.gmail.com>
Date: Tue, 30 Aug 2011 13:42:11 -0500
Cc: user@pig.apache.org,
 user@hive.apache.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com>
References: 
 <CAFFf+k9_xcdHk27qGH5+oBsqLC=N=u57XUJc4AiuXzi5jttqeg@mail.gmail.com>
 <CAFnEcnihv2D6RFR8RyhKZv_L0Oy2oTVktL+qF0ze3rjUXEyQqQ@mail.gmail.com>
 <CAFFf+k9xcu6BTrFr7dkpyFjXbiMev_GQ1puLZi8qmyX5Swd01g@mail.gmail.com>
 <6093B199-A124-42B7-9F88-4A2D8482F709@gmail.com>
 <CAFFf+k_+XfKKGaSW26YeBe9KF6nC4AYuqpJZ-Jr4cQZxSnjbNQ@mail.gmail.com>
To: user@cassandra.apache.org

I've tried to help out with some UDFs and references that help with our =
use case: https://github.com/jeromatron/pygmalion/

There are some brisk docs on pig as well that might be helpful: =
http://www.datastax.com/docs/0.8/brisk/about_pig

On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:

> Thanks Jeremy for your response. That gives me some encouragement, =
that I might be on that right track.
>=20
> I think I need to try out more stuff before coming to a conclusion on =
Brisk.
>=20
> For Pig operations over Cassandra, I only could find =
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there =
any other resource that you can point me to? There seems to be a lack of =
samples on this subject.
>=20
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna =
<jeremy.hanna1234@gmail.com> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to =
potentially move to Brisk because of the simplicity of operations there.
>=20
> Not sure what you mean about the true power of Hadoop.  In my mind the =
true power of Hadoop is the ability to parallelize jobs and send each =
task to where the data resides.  HDFS exists to enable that.  Brisk is =
just another HDFS compatible implementation.  If you're already storing =
your data in Cassandra and are looking to use Hadoop with it, then I =
would seriously consider using Brisk.
>=20
> That said, Cassandra with Hadoop works fine.
>=20
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
>=20
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsaedy@gmail.com> =
wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then =
install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in =
order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache =
itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally =
switching
> > to Cassandra is not the right thing to do. Just my opinion there. I =
feel we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a =
look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache =
projects I
> > use. Here's a sample data analysis I do with my language. Maybe, =
there is no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name=3D"NodeId">
> > <index name=3D"ServerName" start=3D"" end=3D""/>
> > <!--<index name=3D"nodeId" start=3D"AS" end=3D"FB"/>-->
> > <!--<groupBy index=3D"nodeId"/>-->
> > <granularity index=3D"timeStamp" type=3D"hour"/>
> > </get>
> >
> > <lookup name=3D"Event"/>
> >
> > <aggregate>
> > <measure name=3D"RequestCount" aggregationType=3D"CUMULATIVE"/>
> > <measure name=3D"ResponseCount" aggregationType=3D"CUMULATIVE"/>
> > <measure name=3D"MaximumResponseTime" aggregationType=3D"AVG"/>
> > </aggregate>
> >
> > <put name=3D"NodeResult" indexRow=3D"allKeys"/>
> >
> > <log/>
> >
> > <get name=3D"NodeResult">
> > <index name=3D"ServerName" start=3D"" end=3D""/>
> > <groupBy index=3D"ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name=3D"RequestCount" aggregationType=3D"CUMULATIVE"/>
> > <measure name=3D"ResponseCount" aggregationType=3D"CUMULATIVE"/>
> > <measure name=3D"MaximumResponseTime" aggregationType=3D"AVG"/>
> > </aggregate>
> >
> > <put name=3D"NodeAccumilator" indexRow=3D"allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mccloud35@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow =
(using
> >>> a simple custom data flow language) and configure jobs to run =
against stored
> >>> data. I use quartz to schedule and run these jobs and the data =
exists on
> >>> various data stores (mainly Cassandra but some data exists in =
RDBMS like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for =
standard data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my =
system to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. =
I've
> >>> been reading up and still am contemplating on how to make this =
change.
> >>>
> >>> It would be great to hear the recommended approach of doing this =
on Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a =
sample
> >>> task that already runs on the system is "once in every hour, get =
rows from
> >>> column family X, aggregate data in columns A, B and C and write =
back to
> >>> column family Y, and enter details of last aggregated row into a =
table in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>=20
>=20
>=20
>=20
> --=20
> Regards,
>=20
> Tharindu