Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 483DE795D for ; Sun, 4 Sep 2011 03:59:40 +0000 (UTC) Received: (qmail 74039 invoked by uid 500); 4 Sep 2011 03:59:40 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 73964 invoked by uid 500); 4 Sep 2011 03:59:35 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 73954 invoked by uid 99); 4 Sep 2011 03:59:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Sep 2011 03:59:31 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Sep 2011 03:59:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id CBA8F56D20 for ; Sun, 4 Sep 2011 03:59:09 +0000 (UTC) Date: Sun, 4 Sep 2011 03:59:09 +0000 (UTC) From: "Brandyn White (JIRA)" To: commits@cassandra.apache.org Message-ID: <1253544283.15482.1315108749830.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1656300432.15455.1315107309801.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CASSANDRA-3134) Patch Hadoop Streaming Source to Support Cassandra IO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096811#comment-13096811 ] Brandyn White commented on CASSANDRA-3134: ------------------------------------------ So the only requirement is that it have TypedBytes support. I personally use CDH but I believe it was accepted upstream in [Hadoop .21|http://hadoop.apache.org/common/docs/r0.21.0/changes.html]. So this would work in Vanilla .21 and CDH 2/3. > Patch Hadoop Streaming Source to Support Cassandra IO > ----------------------------------------------------- > > Key: CASSANDRA-3134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3134 > Project: Cassandra > Issue Type: New Feature > Components: Hadoop > Reporter: Brandyn White > Priority: Minor > Labels: hadoop, hadoop_examples_streaming > Original Estimate: 504h > Remaining Estimate: 504h > > (text is a repost from [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497]) > I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python library and I'm interested in taking another stab at streaming support. Hadoopy and Dumbo both use the TypedBytes format that is in CDH for communication with the streaming jar. A simple way to get this to work is modify the streaming code (make hadoop-cassandra-streaming.jar) so that it uses the same TypedBytes communication with streaming programs, but the actual job IO is using the Cassandra IO. The user would have the exact same streaming interface, but the user would specify the keyspace, etc using environmental variables. > The benefits of this are > 1. Easy implementation: Take the cloudera-patched version of streaming and change the IO, add environmental variable reading. > 2. Only Client side: As the streaming jar is included in the job, no server side changes are required. > 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this would require the same simple fixup as any other Hadoop job. > 4. The TypedBytes format supports all of the necessary Cassandara types (https://issues.apache.org/jira/browse/HADOOP-5450) > 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only need to know the path of this new streaming jar > 6. No need for avro > The negatives of this are > 1. Duplicative code: This would be a dupe and patch of the streaming jar. This can be stored itself as a patch. > 2. I'd have to check but this solution should work on a stock hadoop (cluster side) but it requires TypedBytes (client side) which can be included in the jar. > I can code this up but I wanted to get some feedback from the community first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira