Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of mccloud35@gmail.com
 designates 209.85.160.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com>
References: 
 <CAFFf+k9_xcdHk27qGH5+oBsqLC=N=u57XUJc4AiuXzi5jttqeg@mail.gmail.com>
	<CAFnEcnihv2D6RFR8RyhKZv_L0Oy2oTVktL+qF0ze3rjUXEyQqQ@mail.gmail.com>
	<CAFFf+k9xcu6BTrFr7dkpyFjXbiMev_GQ1puLZi8qmyX5Swd01g@mail.gmail.com>
	<6093B199-A124-42B7-9F88-4A2D8482F709@gmail.com>
	<CAFFf+k_+XfKKGaSW26YeBe9KF6nC4AYuqpJZ-Jr4cQZxSnjbNQ@mail.gmail.com>
	<72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com>
Date: Wed, 31 Aug 2011 11:00:48 +0530
Message-ID: 
 <CAFFf+k9-konoz-hkD5rwCEUdmQ5P6jhUVBr3ZvszEuAGuTENBQ@mail.gmail.com>
Subject: Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS
From: Tharindu Mathew <mccloud35@gmail.com>
To: user@pig.apache.org
Cc: user@cassandra.apache.org, user@hive.apache.org
Content-Type: multipart/alternative; boundary=20cf303f6a1c3d5e5904abc66ea5

--20cf303f6a1c3d5e5904abc66ea5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna
<jeremy.hanna1234@gmail.com>wrote:

> I've tried to help out with some UDFs and references that help with our u=
se
> case: https://github.com/jeromatron/pygmalion/
>
> There are some brisk docs on pig as well that might be helpful:
> http://www.datastax.com/docs/0.8/brisk/about_pig
>
> On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:
>
> > Thanks Jeremy for your response. That gives me some encouragement, that=
 I
> might be on that right track.
> >
> > I think I need to try out more stuff before coming to a conclusion on
> Brisk.
> >
> > For Pig operations over Cassandra, I only could find
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there an=
y
> other resource that you can point me to? There seems to be a lack of samp=
les
> on this subject.
> >
> > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
> >
> > Not sure what you mean about the true power of Hadoop.  In my mind the
> true power of Hadoop is the ability to parallelize jobs and send each tas=
k
> to where the data resides.  HDFS exists to enable that.  Brisk is just
> another HDFS compatible implementation.  If you're already storing your d=
ata
> in Cassandra and are looking to use Hadoop with it, then I would seriousl=
y
> consider using Brisk.
> >
> > That said, Cassandra with Hadoop works fine.
> >
> > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for your response.
> > >
> > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsaedy@gmail.com>
> wrote:
> > >
> > >> Hi Tharindu, try having a look at Brisk(
> > >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> > >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> > >> to enable data import/export between Hadoop and MySQL.
> > >> Does this sound ok to you ?
> > >>
> > >> These do sound ok. But I was looking at using something from Apache
> itself.
> > >
> > > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > > to Cassandra is not the right thing to do. Just my opinion there. I
> feel we
> > > are not using the true power of Hadoop then.
> > >
> > > I feel Pig has more integration with Cassandra, so I might take a loo=
k
> > > there.
> > >
> > > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > > use. Here's a sample data analysis I do with my language. Maybe, ther=
e
> is no
> > > generic way to do what I want to do.
> > >
> > >
> > >
> > > <get name=3D"NodeId">
> > > <index name=3D"ServerName" start=3D"" end=3D""/>
> > > <!--<index name=3D"nodeId" start=3D"AS" end=3D"FB"/>-->
> > > <!--<groupBy index=3D"nodeId"/>-->
> > > <granularity index=3D"timeStamp" type=3D"hour"/>
> > > </get>
> > >
> > > <lookup name=3D"Event"/>
> > >
> > > <aggregate>
> > > <measure name=3D"RequestCount" aggregationType=3D"CUMULATIVE"/>
> > > <measure name=3D"ResponseCount" aggregationType=3D"CUMULATIVE"/>
> > > <measure name=3D"MaximumResponseTime" aggregationType=3D"AVG"/>
> > > </aggregate>
> > >
> > > <put name=3D"NodeResult" indexRow=3D"allKeys"/>
> > >
> > > <log/>
> > >
> > > <get name=3D"NodeResult">
> > > <index name=3D"ServerName" start=3D"" end=3D""/>
> > > <groupBy index=3D"ServerName"/>
> > > </get>
> > >
> > > <aggregate>
> > > <measure name=3D"RequestCount" aggregationType=3D"CUMULATIVE"/>
> > > <measure name=3D"ResponseCount" aggregationType=3D"CUMULATIVE"/>
> > > <measure name=3D"MaximumResponseTime" aggregationType=3D"AVG"/>
> > > </aggregate>
> > >
> > > <put name=3D"NodeAccumilator" indexRow=3D"allKeys"/>
> > >
> > > <log/>
> > >
> > >
> > >> 2011/8/29 Tharindu Mathew <mccloud35@gmail.com>
> > >>
> > >>> Hi,
> > >>>
> > >>> I have an already running system where I define a simple data flow
> (using
> > >>> a simple custom data flow language) and configure jobs to run again=
st
> stored
> > >>> data. I use quartz to schedule and run these jobs and the data exis=
ts
> on
> > >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> > >>> mysql as well).
> > >>>
> > >>> Thinking about scalability and already existing support for standar=
d
> data
> > >>> flow languages in the form of Pig and HiveQL, I plan to move my
> system to
> > >>> Hadoop.
> > >>>
> > >>> I've seen some efforts on the integration of Cassandra and Hadoop.
> I've
> > >>> been reading up and still am contemplating on how to make this
> change.
> > >>>
> > >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> > >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> > >>> task that already runs on the system is "once in every hour, get ro=
ws
> from
> > >>> column family X, aggregate data in columns A, B and C and write bac=
k
> to
> > >>> column family Y, and enter details of last aggregated row into a
> table in
> > >>> mysql"
> > >>>
> > >>> Thanks in advance.
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Tharindu
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *Eric Djatsa Yota*
> > >> *Double degree MsC Student in Computer Science Engineering and
> > >> Communication Networks
> > >> T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> > >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> > >> djatsaedy@gmail.com
> > >> *Tel : 0601791859*
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Tharindu
> >
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


--=20
Regards,

Tharindu

--20cf303f6a1c3d5e5904abc66ea5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks Jeremy. These will be really useful.<br><br><div class=3D"gmail_quot=
e">On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna <span dir=3D"ltr">&lt;<a =
href=3D"mailto:jeremy.hanna1234@gmail.com">jeremy.hanna1234@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">I&#39;ve tried to help out with some UDFs a=
nd references that help with our use case: <a href=3D"https://github.com/je=
romatron/pygmalion/" target=3D"_blank">https://github.com/jeromatron/pygmal=
ion/</a><br>

<br>
There are some brisk docs on pig as well that might be helpful: <a href=3D"=
http://www.datastax.com/docs/0.8/brisk/about_pig" target=3D"_blank">http://=
www.datastax.com/docs/0.8/brisk/about_pig</a><br>
<div><div></div><div class=3D"h5"><br>
On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:<br>
<br>
&gt; Thanks Jeremy for your response. That gives me some encouragement, tha=
t I might be on that right track.<br>
&gt;<br>
&gt; I think I need to try out more stuff before coming to a conclusion on =
Brisk.<br>
&gt;<br>
&gt; For Pig operations over Cassandra, I only could find <a href=3D"http:/=
/svn.apache.org/repos/asf/cassandra/trunk/contrib/pig" target=3D"_blank">ht=
tp://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig</a>. Are there an=
y other resource that you can point me to? There seems to be a lack of samp=
les on this subject.<br>

&gt;<br>
&gt; On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna &lt;<a href=3D"mailto:j=
eremy.hanna1234@gmail.com">jeremy.hanna1234@gmail.com</a>&gt; wrote:<br>
&gt; FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to =
potentially move to Brisk because of the simplicity of operations there.<br=
>
&gt;<br>
&gt; Not sure what you mean about the true power of Hadoop. =A0In my mind t=
he true power of Hadoop is the ability to parallelize jobs and send each ta=
sk to where the data resides. =A0HDFS exists to enable that. =A0Brisk is ju=
st another HDFS compatible implementation. =A0If you&#39;re already storing=
 your data in Cassandra and are looking to use Hadoop with it, then I would=
 seriously consider using Brisk.<br>

&gt;<br>
&gt; That said, Cassandra with Hadoop works fine.<br>
&gt;<br>
&gt; On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:<br>
&gt;<br>
&gt; &gt; Hi Eric,<br>
&gt; &gt;<br>
&gt; &gt; Thanks for your response.<br>
&gt; &gt;<br>
&gt; &gt; On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa &lt;<a href=3D"mailt=
o:djatsaedy@gmail.com">djatsaedy@gmail.com</a>&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt;&gt; Hi Tharindu, try having a look at Brisk(<br>
&gt; &gt;&gt; <a href=3D"http://www.datastax.com/products/brisk" target=3D"=
_blank">http://www.datastax.com/products/brisk</a>) it integrates Hadoop wi=
th<br>
&gt; &gt;&gt; Cassandra and is shipped with Hive for SQL analysis. You can =
then install<br>
&gt; &gt;&gt; Sqoop(<a href=3D"http://www.cloudera.com/downloads/sqoop/" ta=
rget=3D"_blank">http://www.cloudera.com/downloads/sqoop/</a>) on top of Had=
oop in order<br>
&gt; &gt;&gt; to enable data import/export between Hadoop and MySQL.<br>
&gt; &gt;&gt; Does this sound ok to you ?<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; These do sound ok. But I was looking at using something from =
Apache itself.<br>
&gt; &gt;<br>
&gt; &gt; Brisk sounds nice, but I feel that disregarding HDFS and totally =
switching<br>
&gt; &gt; to Cassandra is not the right thing to do. Just my opinion there.=
 I feel we<br>
&gt; &gt; are not using the true power of Hadoop then.<br>
&gt; &gt;<br>
&gt; &gt; I feel Pig has more integration with Cassandra, so I might take a=
 look<br>
&gt; &gt; there.<br>
&gt; &gt;<br>
&gt; &gt; Whichever I choose, I will contribute the code back to the Apache=
 projects I<br>
&gt; &gt; use. Here&#39;s a sample data analysis I do with my language. May=
be, there is no<br>
&gt; &gt; generic way to do what I want to do.<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;get name=3D&quot;NodeId&quot;&gt;<br>
&gt; &gt; &lt;index name=3D&quot;ServerName&quot; start=3D&quot;&quot; end=
=3D&quot;&quot;/&gt;<br>
&gt; &gt; &lt;!--&lt;index name=3D&quot;nodeId&quot; start=3D&quot;AS&quot;=
 end=3D&quot;FB&quot;/&gt;--&gt;<br>
&gt; &gt; &lt;!--&lt;groupBy index=3D&quot;nodeId&quot;/&gt;--&gt;<br>
&gt; &gt; &lt;granularity index=3D&quot;timeStamp&quot; type=3D&quot;hour&q=
uot;/&gt;<br>
&gt; &gt; &lt;/get&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;lookup name=3D&quot;Event&quot;/&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;aggregate&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;RequestCount&quot; aggregationType=3D&qu=
ot;CUMULATIVE&quot;/&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;ResponseCount&quot; aggregationType=3D&q=
uot;CUMULATIVE&quot;/&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;MaximumResponseTime&quot; aggregationTyp=
e=3D&quot;AVG&quot;/&gt;<br>
&gt; &gt; &lt;/aggregate&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;put name=3D&quot;NodeResult&quot; indexRow=3D&quot;allKeys&qu=
ot;/&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;log/&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;get name=3D&quot;NodeResult&quot;&gt;<br>
&gt; &gt; &lt;index name=3D&quot;ServerName&quot; start=3D&quot;&quot; end=
=3D&quot;&quot;/&gt;<br>
&gt; &gt; &lt;groupBy index=3D&quot;ServerName&quot;/&gt;<br>
&gt; &gt; &lt;/get&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;aggregate&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;RequestCount&quot; aggregationType=3D&qu=
ot;CUMULATIVE&quot;/&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;ResponseCount&quot; aggregationType=3D&q=
uot;CUMULATIVE&quot;/&gt;<br>
&gt; &gt; &lt;measure name=3D&quot;MaximumResponseTime&quot; aggregationTyp=
e=3D&quot;AVG&quot;/&gt;<br>
&gt; &gt; &lt;/aggregate&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;put name=3D&quot;NodeAccumilator&quot; indexRow=3D&quot;allKe=
ys&quot;/&gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;log/&gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;&gt; 2011/8/29 Tharindu Mathew &lt;<a href=3D"mailto:mccloud35@gma=
il.com">mccloud35@gmail.com</a>&gt;<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt; Hi,<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; I have an already running system where I define a simple =
data flow (using<br>
&gt; &gt;&gt;&gt; a simple custom data flow language) and configure jobs to=
 run against stored<br>
&gt; &gt;&gt;&gt; data. I use quartz to schedule and run these jobs and the=
 data exists on<br>
&gt; &gt;&gt;&gt; various data stores (mainly Cassandra but some data exist=
s in RDBMS like<br>
&gt; &gt;&gt;&gt; mysql as well).<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; Thinking about scalability and already existing support f=
or standard data<br>
&gt; &gt;&gt;&gt; flow languages in the form of Pig and HiveQL, I plan to m=
ove my system to<br>
&gt; &gt;&gt;&gt; Hadoop.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; I&#39;ve seen some efforts on the integration of Cassandr=
a and Hadoop. I&#39;ve<br>
&gt; &gt;&gt;&gt; been reading up and still am contemplating on how to make=
 this change.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; It would be great to hear the recommended approach of doi=
ng this on Hadoop<br>
&gt; &gt;&gt;&gt; with the integration of Cassandra and other RDBMS. For ex=
ample, a sample<br>
&gt; &gt;&gt;&gt; task that already runs on the system is &quot;once in eve=
ry hour, get rows from<br>
&gt; &gt;&gt;&gt; column family X, aggregate data in columns A, B and C and=
 write back to<br>
&gt; &gt;&gt;&gt; column family Y, and enter details of last aggregated row=
 into a table in<br>
&gt; &gt;&gt;&gt; mysql&quot;<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; Thanks in advance.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; --<br>
&gt; &gt;&gt;&gt; Regards,<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; Tharindu<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; --<br>
&gt; &gt;&gt; *Eric Djatsa Yota*<br>
&gt; &gt;&gt; *Double degree MsC Student in Computer Science Engineering an=
d<br>
&gt; &gt;&gt; Communication Networks<br>
&gt; &gt;&gt; T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY=
)*<br>
&gt; &gt;&gt; *Intern at AMADEUS S.A.S Sophia Antipolis*<br>
&gt; &gt;&gt; <a href=3D"mailto:djatsaedy@gmail.com">djatsaedy@gmail.com</a=
><br>
&gt; &gt;&gt; *Tel : 0601791859*<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; --<br>
&gt; &gt; Regards,<br>
&gt; &gt;<br>
&gt; &gt; Tharindu<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Regards,<br>
&gt;<br>
&gt; Tharindu<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Regards,<br><br>Tharindu<br>

--20cf303f6a1c3d5e5904abc66ea5--