cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna
Date Sat, 16 Oct 2010 16:29:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna.
The comment on this change is: Trying to update the hadoop support page with more recent info
+ more structure for linking..
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=14&rev2=15

--------------------------------------------------

+ <<Anchor(Top)>>
+ 
+ == Contents ==
+  * [[#overview|Overview]]
+  * [[#MapReduce|MapReduce Support]]
+  * [[#Pig|Pig Support]]
+  * [[#Hive|Hive Support]]
+ 
+ <<Anchor(Overview)>>
+ 
  == Overview ==
- Cassandra version 0.6 and later enable certain Hadoop functionality against Cassandra's
data store.  Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]]
and [[http://hadoop.apache.org/pig/|Pig]].
+ Cassandra version 0.6 and later enable certain Hadoop functionality against Cassandra's
data store.  Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]],
[[http://hadoop.apache.org/pig/|Pig]] and [[http://hive.apache.org/|Hive]].
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(MapReduce)>>
  
  == MapReduce ==
+ 
+ ==== Input from Cassandra ====
- While writing output to Cassandra has always been possible by implementing certain interfaces
from the Hadoop library, version 0.6 of Cassandra added support for retrieving data from Cassandra.
 Cassandra 0.6 adds implementations of [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
and [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
so that Hadoop [[http://hadoop.apache.org/mapreduce/|MapReduce]] jobs can retrieve data from
Cassandra.  For an example of how this works, see the contrib/word_count example in 0.6 or
later.  Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of columns)
are input to Map tasks for  processing by your job, as specified by a `SlicePredicate`  that
describes which columns to fetch from each row.
+ Cassandra 0.6 (and later) adds support for retrieving data from Cassandra.  This is based
on implementations of [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
and [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
so that Hadoop MapReduce jobs can retrieve data from Cassandra.  For an example of how this
works, see the contrib/word_count example in 0.6 or later.  Cassandra rows or row  fragments
(that is, pairs of key + `SortedMap`  of columns) are input to Map tasks for  processing by
your job, as specified by a `SlicePredicate`  that describes which columns to fetch from each
row.
  
  Here's how this looks in the word_count example, which selects just one  configurable columnName
from each row:
  
@@ -13, +29 @@

              SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- Cassandra's splits are location-aware (this is the nature of the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
design).  Cassandra  gives the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
a list of locations with each split of data.  That way, the !JobTracker can try to preserve
data locality when  assigning tasks to [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
 Therefore, when using Hadoop alongside  Cassandra, it is best to have a !TaskTracker running
on the same node as  the Cassandra nodes, if data locality while processing is desired and
to  minimize copying data between Cassandra and Hadoop nodes.
+ Cassandra's splits are location-aware (this is the nature of the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
design).  Cassandra  gives the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
a list of locations with each split of data.  That way, the !JobTracker can try to preserve
data locality when assigning tasks to [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
 Therefore, when using Hadoop alongside Cassandra, it is best to have a !TaskTracker running
on each Cassandra node.
  
- As of 0.7, there will be a basic mechanism included in Cassandra for  outputting data to
cassandra.  See [[https://issues.apache.org/jira/browse/CASSANDRA-1101|CASSANDRA-1101]]  for
details.
+ As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml.
See the READMEs in the word_count and pig contrib modules for more details.
+ 
+ ==== Output To Cassandra ====
+ 
+ As of 0.7, there is be a basic mechanism included in Cassandra for outputting data to Cassandra.
 The contrib/word_count example in 0.7 contains two reducers - one for outputting data to
the filesystem (default) and one to output data to Cassandra using this new mechanism.  See
that example in the latest release for details.
+ 
+ ==== Hadoop Streaming ====
+ 
+ As of 0.7, there is support for [[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop
Streaming]].  For examples on how to use Streaming with Cassandra, see the contrib section
of the Cassandra source.  The relevant tickets are [[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
+ 
+ ==== Some troubleshooting ====
  
  Releases before  0.6.2/0.7 are affected by a small  resource leak that may cause jobs to
fail (connections are not released  properly, causing a resource leak). Depending on your
local setup you  may hit this issue, and workaround it by raising the limit of open file 
descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`).  The error will
be reported on  the hadoop job side as a thrift !TimedOutException.
  
@@ -30, +56 @@

  {{{
               ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
  }}}
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(Pig)>>
+ 
  == Pig ==
- Cassandra 0.6 also adds support for [[http://hadoop.apache.org/pig/|Pig]] with its own implementation
of [[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
 This allows Pig queries to be run against data stored in Cassandra.  For an example of this,
see the contrib/pig example in 0.6 and later.
+ Cassandra 0.6+ also adds support for [[http://hadoop.apache.org/pig/|Pig]] with its own
implementation of [[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
 This allows Pig queries to be run against data stored in Cassandra.  For an example of this,
see the contrib/pig example in 0.6 and later.
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(Hive)>>
  
  == Hive ==
- Work is being done to add Hive support - see [[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]]
and [[https://issues.apache.org/jira/browse/CASSANDRA-913|CASSANDRA-913]]
+ Work is being finalized to add support for Hive - see [[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]].
  
+ [[#Top|Top]]
+ 

Mime
View raw message