cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antonio Piccolboni (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3134) Patch Hadoop Streaming Source to Support Cassandra IO
Date Tue, 27 Mar 2012 17:34:26 GMT


Antonio Piccolboni commented on CASSANDRA-3134:

Hi, I am developing a package for R (rmr) that depends on streaming and uses typedbytes. I
was wondering if the discussion here is still heading in a direction of having two separate
streaming jars. For other data stores, such as HBase, one has just to change the inputformat
option (to fm.last.hbase.mapred.TypedBytesTableInputFormat) and supply additional options
to select tables and columns, but the jar is the same. It seems like having a jar for each
data store would create a lot of duplication and would not take advantage of the pluggable
IO that streaming offers, or am I missing something?
> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>                 Key: CASSANDRA-3134
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Brandyn White
>            Priority: Minor
>              Labels: hadoop, hadoop_examples_streaming
>   Original Estimate: 504h
>  Remaining Estimate: 504h
> (text is a repost from [CASSANDRA-1497|])
> I'm the author of the Hadoopy python library and I'm
interested in taking another stab at streaming support. Hadoopy and Dumbo both use the TypedBytes
format that is in CDH for communication with the streaming jar. A simple way to get this to
work is modify the streaming code (make hadoop-cassandra-streaming.jar) so that it uses the
same TypedBytes communication with streaming programs, but the actual job IO is using the
Cassandra IO. The user would have the exact same streaming interface, but the user would specify
the keyspace, etc using environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and change the
IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server side changes
are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this would require
the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types (
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only need to
know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar. This can be
stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster side) but
it requires TypedBytes (client side) which can be included in the jar.
> I can code this up but I wanted to get some feedback from the community first.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message