incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Oskarsson <jo...@oskarsson.nu>
Subject Re: Cassandra and hadoop?
Date Wed, 17 Mar 2010 08:49:04 GMT
Hi Matteo,

* Hadoop MapReduce can talk to Cassandra and process the data just like 
other input formats does from HDFS. But I would not recommend seeing 
Cassandra as a first class replacement for HDFS, they are two very 
different beasts. It will most likely always be a lot faster to let 
MapReduce read data from HDFS. If you are going to run many jobs over 
the same data from Cassandra I would recommend first using a MapReduce 
job that just fetches the data to HDFS.

* The data is fetched from Cassandra using Thrift so you don't have to 
run the Hadoop nodes on the same nodes as Cassandra.

* The input format will try to read from the local node if possible.

/Johan

Matteo Caprari wrote:
> Hi.
> 
> I've tried the mapreduce example in 0.6 contrib/wordcount and it
> worked very well.
> 
> I have a shallow understanding of both worlds, so pardon my questions:
> 
> Is the integration with hadoop just 'semantic' (ie map/reduce api is
> only used as query abstraction) or is
> it 'structural' (ie cassandra can 'talk to hadoop' and replace HDFS as
> input source)?
> 
> In practice:
> - If I want to run a distributed mapreduce job on cassandra, does my
> cassandra cluster have to be an hadoop cluster as well?
> - do I get data locality optimization: I reckon cassandra can in
> principle figure out where it is best to execute a
> SlicePredicate/Mapper,
> but to do so it should take over some of the responsibilities of
> hadoop's jobtracker. Does it?
> 
> Thanks.


Mime
View raw message