crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: inconsistent grouping of map jobs
Date Wed, 20 Feb 2013 14:43:21 GMT
Hey Mike,

The code looks right to me. Let me whip up a test and see if I can
replicate it easily-- is there anything funky beyond what's in your snippet
that I should be aware of?

J


On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <mike.barretta@gmail.com>wrote:

> I have a number of "tables" in HDFS, represented as folders containing
> SequenceFiles of serialized objects.  I'm trying to write a tool that will
> reassemble these objects and output each of the tables into its own CSV
> file.
>
> The wrinkle is that some of the "tables" hold objects with a list of
> related child objects.  Those related should get chopped into their own
> table.
>
> Here is essentially what my loop looks like (in Groovy):
>
> //loop through each top-level table
> paths.each { path ->
>     def source = From.sequenceFile(new Path(path),
>
> Writables.writables(ColumnKey.class),
>
> Writables.writables(ColumnDataArrayWritable.class)
>     )
>
>     //read it in
>     def data = crunchPipeline.read(source)
>
>     //write it out
>     crunchPipeline.write(
>         data.parallelDo(new MyDoFn(path), Writables.strings()),
>         To.textFile("$path/csv")
>     )
>
>     //handle children using same PTable as parent
>     if (path == TABLE_MESSAGE_DATA) {
>         messageChildPaths.each {  childPath ->
>             crunchPipeline.write(
>                 data.parallelDo(new MyDoFn(childPath),
> Writables.strings()),
>                 To.textFile("$childPath/csv")
>             )
>         }
>     }
> }
>
> The parent and child jobs generally get grouped into a single map job, but
> most of the time, only some of the children tables get included, which is
> to say, sometimes a child table does not get output.  There doesn't seem to
> be a pattern - sometimes all of them get included, sometimes 1 or 2.
>
> Am I missing something? Is there a way to specify which jobs should be
> combined?
>
> Thanks,
> Mike
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message