cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna
Date Mon, 09 Apr 2012 12:58:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=47&rev2=48

Comment:
Removing brisk from cluster config as it will now only confuse people.

  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk,
the open-source packaging of Cassandra with Hadoop.  That will start the `JobTracker` and
`TaskTracker` processes for you.  It also uses CFS, an HDFS compatible distributed filesystem
built on Cassandra that removes the need for a Hadoop `NameNode` and `DataNode` processes.
 For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]]
and [[http://github.com/riptano/brisk|code]]
+ If you would like to configure a Cassandra cluster yourself so that Hadoop may operate over
its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll want to
have a separate server for your Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker`
on each of your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the
Cassandra nodes that contain data for those tasks.  Also install a Hadoop `DataNode` on each
Cassandra node.  Hadoop requires a distributed filesystem for copying dependency jars, static
data, and intermediate results to be stored.
  
- Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may
operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll
want to have a separate server for your Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop
`TaskTracker` on each of your Cassandra nodes.  That will allow the `JobTracker` to assign
tasks to the Cassandra nodes that contain data for those tasks.  Also install a Hadoop `DataNode`
on each Cassandra node.  Hadoop requires a distributed filesystem for copying dependency jars,
static data, and intermediate results to be stored.
- 
- The nice thing about having a `TaskTracker` on every node is that you get data locality
and your analytics engine scales with your data. You also never need to shuttle around your
data once you've performed analytics on it - you simply output to Cassandra and you are able
to access that data with high random-read performance.
+ The nice thing about having a `TaskTracker` on every node is that you get data locality
and your analytics engine scales with your data. You also never need to shuttle around your
data once you've performed analytics on it - you simply output to Cassandra and you are able
to access that data with high random-read performance. Note that Cassandra implements the
same interface as HDFS to achieve data locality.
  
  A note on speculative execution: you may want to disable speculative execution for your
hadoop jobs that either read or write to Cassandra.  This isn't required, but may be helpful
to reduce unnecessary load.
  

Mime
View raw message