cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandyn White (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-1497) Add input support for Hadoop Streaming
Date Sat, 03 Sep 2011 17:12:10 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096726#comment-13096726
] 

Brandyn White commented on CASSANDRA-1497:
------------------------------------------

I am certainly interested in streaming and have talked to others that are.  I'm the author
of the Hadoopy http://bwhite.github.com/hadoopy/ python library and I'm interested in taking
another stab at streaming support.  Hadoopy and Dumbo both use the TypedBytes format that
is in CDH for communication with the streaming jar.  A simple way to get this to work is modify
the streaming code (make hadoop-cassandra-streaming.jar) so that it uses the same TypedBytes
communication with streaming programs, but the actual job IO is using the Cassandra IO.  The
user would have the exact same streaming interface, but the user would specify the keyspace,
etc using environmental variables.

The benefits of this are
1. Easy implementation: Take the cloudera-patched version of streaming and change the IO,
add environmental variable reading.
2. Only Client side: As the streaming jar is included in the job, no server side changes are
required.
3. Simple maintenance: If the Hadoop Cassandra interface changes, then this would require
the same simple fixup as any other Hadoop job.
4. The TypedBytes format supports all of the necessary Cassandara types (https://issues.apache.org/jira/browse/HADOOP-5450)
5. Compatible with existing streaming libraries: Hadoopy and dumbo would only need to know
the path of this new streaming jar
6. No need for avro

The negatives of this are
1. Duplicative code: This would be a dupe and patch of the streaming jar.  This can be stored
itself as a patch.
2. I'd have to check but this solution should work on a stock hadoop (cluster side) but it
requires TypedBytes (client side) which can be included in the jar.

I can code this up but I wanted to get some feedback from the community first.

> Add input support for Hadoop Streaming
> --------------------------------------
>
>                 Key: CASSANDRA-1497
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1497
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Jeremy Hanna
>         Attachments: 0001-An-updated-avro-based-input-streaming-solution.patch
>
>
> related to CASSANDRA-1368 - create similar functionality for input streaming.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message