cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tharindu Mathew <mcclou...@gmail.com>
Subject Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS
Date Wed, 31 Aug 2011 05:30:48 GMT
Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna
<jeremy.hanna1234@gmail.com>wrote:

> I've tried to help out with some UDFs and references that help with our use
> case: https://github.com/jeromatron/pygmalion/
>
> There are some brisk docs on pig as well that might be helpful:
> http://www.datastax.com/docs/0.8/brisk/about_pig
>
> On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:
>
> > Thanks Jeremy for your response. That gives me some encouragement, that I
> might be on that right track.
> >
> > I think I need to try out more stuff before coming to a conclusion on
> Brisk.
> >
> > For Pig operations over Cassandra, I only could find
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
> other resource that you can point me to? There seems to be a lack of samples
> on this subject.
> >
> > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
> >
> > Not sure what you mean about the true power of Hadoop.  In my mind the
> true power of Hadoop is the ability to parallelize jobs and send each task
> to where the data resides.  HDFS exists to enable that.  Brisk is just
> another HDFS compatible implementation.  If you're already storing your data
> in Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
> >
> > That said, Cassandra with Hadoop works fine.
> >
> > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for your response.
> > >
> > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsaedy@gmail.com>
> wrote:
> > >
> > >> Hi Tharindu, try having a look at Brisk(
> > >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> > >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> > >> to enable data import/export between Hadoop and MySQL.
> > >> Does this sound ok to you ?
> > >>
> > >> These do sound ok. But I was looking at using something from Apache
> itself.
> > >
> > > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > > to Cassandra is not the right thing to do. Just my opinion there. I
> feel we
> > > are not using the true power of Hadoop then.
> > >
> > > I feel Pig has more integration with Cassandra, so I might take a look
> > > there.
> > >
> > > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > > use. Here's a sample data analysis I do with my language. Maybe, there
> is no
> > > generic way to do what I want to do.
> > >
> > >
> > >
> > > <get name="NodeId">
> > > <index name="ServerName" start="" end=""/>
> > > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > > <!--<groupBy index="nodeId"/>-->
> > > <granularity index="timeStamp" type="hour"/>
> > > </get>
> > >
> > > <lookup name="Event"/>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeResult" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > > <get name="NodeResult">
> > > <index name="ServerName" start="" end=""/>
> > > <groupBy index="ServerName"/>
> > > </get>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeAccumilator" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > >
> > >> 2011/8/29 Tharindu Mathew <mccloud35@gmail.com>
> > >>
> > >>> Hi,
> > >>>
> > >>> I have an already running system where I define a simple data flow
> (using
> > >>> a simple custom data flow language) and configure jobs to run against
> stored
> > >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> > >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> > >>> mysql as well).
> > >>>
> > >>> Thinking about scalability and already existing support for standard
> data
> > >>> flow languages in the form of Pig and HiveQL, I plan to move my
> system to
> > >>> Hadoop.
> > >>>
> > >>> I've seen some efforts on the integration of Cassandra and Hadoop.
> I've
> > >>> been reading up and still am contemplating on how to make this
> change.
> > >>>
> > >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> > >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> > >>> task that already runs on the system is "once in every hour, get rows
> from
> > >>> column family X, aggregate data in columns A, B and C and write back
> to
> > >>> column family Y, and enter details of last aggregated row into a
> table in
> > >>> mysql"
> > >>>
> > >>> Thanks in advance.
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Tharindu
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *Eric Djatsa Yota*
> > >> *Double degree MsC Student in Computer Science Engineering and
> > >> Communication Networks
> > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> > >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> > >> djatsaedy@gmail.com
> > >> *Tel : 0601791859*
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Tharindu
> >
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Mime
View raw message