incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <jeremy.hanna1...@gmail.com>
Subject Re: About python streaming using Cassandra as input
Date Mon, 09 May 2011 20:40:13 GMT
pig/hive/brisk are certainly great ways of doing mapreduce with cassandra.

I had written the patch to 1497 last Fall and it didn't quite work then.  I had meant to get
back to it, but since then I've changed jobs and have been really busy there.

I do like how the patch abstracts the CFIF/CFRR so that it could be pluggable with different
formats - like avro or others.  That would make it possible to plug it in more easily into
dumbo - https://github.com/klbostee/dumbo/wiki/ for instance.  It has an abstract parent that
does all of the Cassandra heavy lifting and each child only does things specific to avro or
etc.

If anyone wants to get the patch working against current 0.7-branch, I wouldn't mind answering
questions about it.  I just don't currently have time to rebase and get it working.  I would
only take the idea and structure from the patch and use as much of 0.7-branch code as possible.
 Doing that, it should be fairly straightforward as it just adds older mapred package method
support and abstracts out the Cassandra specific bits.  

On May 9, 2011, at 3:14 PM, Jonathan Ellis wrote:

> You'll have a lot more luck w/ pig or hive as a high-level hadoop
> client, than python.  Certainly until 1470 is done for real.
> 
> Brisk does the hadoop-on-cassandra integration for you:
> http://www.datastax.com/docs/0.8/brisk/about_brisk#key-features-of-brisk
> 
> On Mon, May 9, 2011 at 2:37 AM, Danhang Tang <danny@zugoservices.com> wrote:
>> Hi all,
>> 
>> I've been trying to apply this patch to Cassandra but ran into some errors.
>> https://issues.apache.org/jira/browse/CASSANDRA-1497
>> 
>> The comments said it's fixed for version 0.7.1. But I can't directly apply
>> it to this version. So I apply it manually to the java files in hadoop
>> package. Compiling was successful. But then when executing the
>> hadoop_streaming_input
>> I encountered a runtime error:
>> 
>> 11/05/06 17:27:21 WARN conf.Configuration: mapred.job.tracker is deprecated.
>> Instead, use mapreduce.jobtracker.address
>> 
>> packageJobJar: [./bin/../../../interface/avro/cassandra.avpr,
>> ./bin/mapper.py, ./bin/reducer.py,
>> /tmp/hadoop-radfactory/hadoop-unjar8363580286439315517/] []
>> /tmp/streamjob4200946905356051819.jar tmpDir=null
>> 
>> 11/05/06 17:27:23 INFO mapreduce.JobSubmitter: Cleaning up the staging area
>> hdfs://client1:9001/tmp/hadoop-root/mapred/staging/radfactory/.staging/job_201105051628_0015
>> 
>> Exception in thread "main" java.lang.InstantiationError:
>> org.apache.hadoop.mapreduce.JobContext
>> 
>> at
>> org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:138)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
>> 
>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
>> 
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
>> 
>> at
>> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:924)
>> 
>> at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:123)
>> 
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>> 
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
>> 
>> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
>> 
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> 
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> 
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> 
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
>> 
>> 
>> 
>> Any ideas?
>> 
>> Thanks,
>> 
>> Danny
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com


Mime
View raw message