hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Real Multiple Outputs for Hadoop -- is this implementation correct?
Date Fri, 13 Sep 2013 19:32:53 GMT
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle <ontology2@gmail.com> wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
> I wrote up the implementation here:
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.

Harsh J

View raw message