incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: cassandra data to hadoop.
Date Sat, 24 Dec 2011 20:05:54 GMT
We're using cdh3 right now and we're in the process of testing out brisk/datastax enterprise
right now.  I know some people who copy data from Cassandra to a completely separate Hadoop
cluster in EC2 at least.  It just takes a bit longer because of the intermachine copy.  That's
better to do in the case that you want to keep them completely separate.  You can, as described
in the wiki link, just have task trackers and data nodes on your cassandra nodes, but you'll
need to group your cassandra nodes by analytic nodes for one virtual DC and another virtual
DC for realtime nodes - as described in the wiki link as well.  We actually use cdh3/cassandra
0.8 in that setup right now and it's working for us.

I don't know that flume/hue would need much tweaking to work with brisk/datastax enterprise.
 We found one patch we needed to have applied to make the oozie configuration better, but
you'd really have to try it out before you know for sure.

Hope that helps - if you have any other specific questions, ask away.


On Dec 24, 2011, at 1:20 AM, ravikumar visweswara wrote:

> Jeremy,
> We use cloudera distribution for our hadoop cluster and may not be possible to migrate
to brisk quickly because of flume/hue dependencies. Did you successfully pull the data from
independent cassandra cluster and dump into completely disconnected hadoop cluster? It will
be really helpful if you elaborate on how to achieve this.
> -R
> On Fri, Dec 23, 2011 at 9:28 AM, Jeremy Hanna <> wrote:
> We do this all the time.  Take a look at
for some details - you can use mapreduce or pig to get data out of cassandra.  If it's going
to a separate hadoop cluster, I don't think you'd need to co-locate task trackers or data
nodes on your cassandra nodes - it would just need to copy over the network though.  We also
use oozie for job scheduling, fwiw.
> On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:
> > Hello All,
> >
> > I have a situation to dump cassandra data to hadoop cluster for further analytics.
Lot of other relevant data which is not present in cassandra is already available in hdfs
for analysis. Both are independent clusters right now.
> > Is there a suggested way to get the data periodically or continuously to HDFS from
cassandra? Any ideas or references will be very helpful for me.
> >
> > Thanks and Regards
> > R

View raw message