cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "HadoopSupport" by JonathanEllis
Date Wed, 28 Sep 2011 21:17:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=43&rev2=44

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against
Cassandra's data store.  Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]],
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution
called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]])
However this code is no longer going to be maintained by !DataStax.  Future !DataStax development
of Brisk is now part of a pay-for offering.
+ [[http://datastax.com|DataStax]] open-sourced a Cassandra based Hadoop distribution called
Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]])
Brisk is now part of [[http://www.datastax.com/products/enterprise|DataStax Enterprise]] and
is no longer maintained as a standalone project.
  
  [[#Top|Top]]
  
@@ -35, +35 @@

              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
  As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml.
See the `README` in the `word_count` and `pig` contrib modules for more details.
- 
  
  ==== Output To Cassandra ====
  As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra.
 The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to
the filesystem and one to output data to Cassandra (default) using this new mechanism.  See
that example in the latest release for details.
@@ -73, +72 @@

  
  == Oozie ==
  [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine originally
from Yahoo!, can be used with Cassandra/Hadoop.  Cassandra configuration information needs
to go into the oozie action configuration like so:
+ 
  {{{
  <property>
      <name>cassandra.thrift.address</name>
@@ -99, +99 @@

      <value>${cassandraRangeBatchSize}</value>
  </property>
  }}}
- Note that with Oozie you can specify values outright like the partitioner here, or via variable
that is typically found in the properties file.
- One other item of note is that Oozie assumes that it can detect a filemarker for successful
completion of the job.  This means that when writing to Cassandra with, for example, Pig,
the Pig script will succeed but the Oozie job that called it will fail because filemarkers
aren't written to Cassandra.  So when you write to Cassandra with Hadoop, specify this property
to avoid that check.  Oozie will still get completion updates from a callback from the job
tracker, but it just won't look for the filemarker.
+ Note that with Oozie you can specify values outright like the partitioner here, or via variable
that is typically found in the properties file. One other item of note is that Oozie assumes
that it can detect a filemarker for successful completion of the job.  This means that when
writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job
that called it will fail because filemarkers aren't written to Cassandra.  So when you write
to Cassandra with Hadoop, specify this property to avoid that check.  Oozie will still get
completion updates from a callback from the job tracker, but it just won't look for the filemarker.
+ 
  {{{
  <property>
      <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
      <value>false</value>
  </property>
  }}}
- 
  [[#Top|Top]]
  
  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- 
  The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk,
the open-source packaging of Cassandra with Hadoop.  That will start the `JobTracker` and
`TaskTracker` processes for you.  It also uses CFS, an HDFS compatible distributed filesystem
built on Cassandra that removes the need for a Hadoop `NameNode` and `DataNode` processes.
 For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]]
and [[http://github.com/riptano/brisk|code]]
  
  Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may
operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll
want to have a separate server for your Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop
`TaskTracker` on each of your Cassandra nodes.  That will allow the `JobTracker` to assign
tasks to the Cassandra nodes that contain data for those tasks.  Also install a Hadoop `DataNode`
on each Cassandra node.  Hadoop requires a distributed filesystem for copying dependency jars,
static data, and intermediate results to be stored.
@@ -136, +134 @@

  
  == Troubleshooting ==
  If you are running into timeout exceptions, you might need to tweak one or both of these
settings:
+ 
   * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this
depending on your data.  This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis`
in `storage-conf.xml`).  The rpc timeout is not for timing out from the client but between
nodes.  This can be increased to reduce chances of timing out.
  
  If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers,
there are settings that can give Cassandra more latitude before failing the jobs.  An example
of usage (in either the job configuration or taskracker mapred-site.xml):
+ 
  {{{
  <property>
    <name>mapred.max.tracker.failures</name>
@@ -161, +161 @@

  The settings normally default to 4 each, but some find that too conservative.  If you set
it too low, you might have blacklisted tasktrackers and failed jobs because of occasional
timeout exceptions.  If you set them too high, jobs that would otherwise fail quickly take
a long time to fail, sacrificing efficiency.  Keep in mind that this can just cover a problem.
 It may be that you always want these settings to be higher when operating against Cassandra.
 However, if you run into these exceptions too frequently, there may be a problem with your
Cassandra or Hadoop configuration.
  
  If you are seeing inconsistent data coming back, consider the consistency level that you
are reading and writing at.  The two relevant properties are:
+ 
   * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
   * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
+ 
  Also hadoop integration uses range scans underneath which do not do read repair.  However
reading at !ConsistencyLevel.QUORUM will reconcile differences among nodes read.  See ReadRepair
section as well as the !ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]]
page for more details.
  
  [[#Top|Top]]

Mime
View raw message